Re: [PATCH GCC]A simple implementation of loop interchange

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

From: "Bin.Cheng" <amker.cheng@gmail.com>
To: Richard Biener <richard.guenther@gmail.com>
Cc: Michael Matz <matz@suse.de>,
	"gcc-patches@gcc.gnu.org" <gcc-patches@gcc.gnu.org>
Subject: Re: [PATCH GCC]A simple implementation of loop interchange
Date: Thu, 23 Nov 2017 17:01:00 -0000	[thread overview]
Message-ID: <CAHFci2_7NW8eAkCEqPYYXULG3j1ROYbnroHgD5TaQxSZE+R9yg@mail.gmail.com> (raw)
In-Reply-To: <CAFiYyc2FfkmHZGyiaWBJgTHRydDyrL9vNGvadd-+MAh4OuYgFg@mail.gmail.com>

Hi Richard,
Thanks for reviewing.  It's quite lot comment, I am trying to resolve
it one by one.  Here I have some questions as embedded.

On Mon, Nov 20, 2017 at 2:46 PM, Richard Biener
<richard.guenther@gmail.com> wrote:
> On Thu, Nov 16, 2017 at 4:18 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
>> On Tue, Oct 24, 2017 at 3:30 PM, Michael Matz <matz@suse.de> wrote:
>>
>>>
>>> I hope this is of some help to you :)
>> Thanks again, it's very helpful.
>>
>> I also fixed several bugs of previous implementation, mostly about debug info
>> statements and simple reductions.  As for test, I enabled this pass by default,
>> bootstrap and regtest GCC, I also build/run specs.  There must be some other
>> latent bugs in it, but guess we have to exercise it by enabling it at
>> some point.
>>
>> So any comments?
>
>  bool
> -gsi_remove (gimple_stmt_iterator *i, bool remove_permanently)
> +gsi_remove (gimple_stmt_iterator *i, bool remove_permanently, bool insert_dbg)
>  {
>
> that you need this suggests you do stmt removal in wrong order (you need to
> do reverse dom order).
>
> +/* Maximum number of statements in loop nest for loop interchange.  */
> +
> +DEFPARAM (PARAM_LOOP_INTERCHANGE_MAX_NUM_STMTS,
> +         "loop-interchange-max-num-stmts",
> +         "The maximum number of stmts in loop nest for loop interchange.",
> +         64, 0, 0)
>
> is that to limit dependence computation?  In this case you should probably
> limit the number of data references instead?
No, this is to limit number of statements in loop.  We don't want to
do interchange for too large loops, right?

>
> +ftree-loop-interchange
> +Common Report Var(flag_tree_loop_interchange) Optimization
> +Enable loop interchange on trees.
> +
>
> please re-use -floop-interchange instead and change the GRAPHITE tests
> to use -floop-nest-optimize.  You can do that as pre-approved thing now.
>
> Please enable the pass by default at O3 via opts.c.
There are quite many (vectorize) test cases affected by interchange
(which is correct I believe), so I will prepare another patch enabling
it at O3 and adjusting the tests, to keep this patch small.

>
> diff --git a/gcc/tree-ssa-loop-interchange.cc b/gcc/tree-ssa-loop-interchange.cc
>
> gimple-loop-interchange.cc please.
>
> new file mode 100644
> index 0000000..abffbf6
> --- /dev/null
> +++ b/gcc/tree-ssa-loop-interchange.cc
> @@ -0,0 +1,2274 @@
> +/* Loop invariant motion.
> +   Copyright (C) 2017 Free Software Foundation, Inc.
>
> Loop invariant motion? ... ;)
>
> Please add a "Contributed by ..." to have an easy way to figure people to blame.
>
> +}*induction_p;
> +
>
> space after '*'
>
> +}*reduction_p;
> +
>
> likewise.
>
> +/* Return true if PHI is unsupported in loop interchange, i.e, PHI contains
> +   ssa var appearing in any abnormal phi node.  */
> +
> +static inline bool
> +unsupported_phi_node (gphi *phi)
> +{
> +  if (SSA_NAME_OCCURS_IN_ABNORMAL_PHI (PHI_RESULT (phi)))
> +    return true;
> +
> +  for (unsigned i = 0; i < gimple_phi_num_args (phi); ++i)
> +    {
> +      tree arg = PHI_ARG_DEF (phi, i);
> +      if (TREE_CODE (arg) == SSA_NAME
> +         && SSA_NAME_OCCURS_IN_ABNORMAL_PHI (arg))
> +       return true;
> +    }
> +
> +  return false;
>
> I believe the above isn't necessary given you rule out abnormal edges
> into the loop.
> Did you have a testcase that broke?  A minor thing I guess if it is
> just for extra
> safety...
>
> +/* Return true if all stmts in BB can be supported by loop interchange,
> +   otherwise return false.  ILOOP is not NULL if this loop_cand is the
> +   outer loop in loop nest.  */
> +
> +bool
> +loop_cand::unsupported_operation (basic_block bb, loop_cand *iloop)
> +{
>
> docs and return value suggest this be named supported_operation
>
> +      /* Or it's invariant memory reference and only used by inner loop.  */
> +      if (gimple_assign_single_p (stmt)
> +         && (lhs = gimple_assign_lhs (stmt)) != NULL_TREE
> +         && TREE_CODE (lhs) == SSA_NAME
> +         && single_use_in_loop (lhs, iloop->loop))
> +       continue;
>
> comment suggests multiple uses in loop would be ok?
>
> +      if ((lhs = gimple_assign_lhs (producer)) == NULL_TREE
> +         || lhs != re->init)
> +       return;
> +
> +      if ((rhs = gimple_assign_rhs1 (producer)) == NULL_TREE
> +         || !REFERENCE_CLASS_P (rhs))
> +       return;
>
> lhs and rhs are never NULL.  Please initialize them outside of the if.
> You want to disallow DECL_P rhs here?
>
> Can you add an overall function comment what this function does?  It seems
> to detect a reduction as produced by loop store-motion?  Thus it tries to
> get at enough information to perform the reverse transform?
>
> During review I have a hard time distinguishing between locals and members
> so can you please prefix all member variables with m_ as according to our
> code guidelines?  I guess what adds to the confusion is the loop_cand argument
> that sometimes is present for loop_cand member functions...
> (I personally prefer to prefix all member accesses with this-> but that's harder
> to enforce)
>
> +static void
> +find_deps_in_bb_for_stmt (gimple_seq *stmts, basic_block bb, gimple *consumer)
> +{
>
> this extracts stmts starting from consumer but rules out other
> consumers (recursively).
> Is that intended?  I wonder if you can avoid this dance...
>
> +      /* Transform simple reduction of below form:
> +
> +          init = 0;
> +          loop:
> +            var = phi<init, next>
> +            next = var op ...
> +          reduc_sum = phi<next>
> +          MEM_REF[...] = reduc_sum
> +
> +        into:
> +
>
> which one is 'consumer'?  I wonder if you can simply leave all the
> dead code in place
> just emitting the load-update-store stmts and removing the
> MEM_REF[...] = reduc_sum
> above?
>
> +/* Eliminate dead code after loop interchange.  */
> +
> +void
> +loop_cand::eliminate_dead_code (void)
> +{
>
> PRE tracks "possible" dead defs and has a worklist algorithm in
> remove_dead_inserted_code.
> I wonder if you can do sth similar?  That is, I wonder if doing a
> sweep from last to first
> stmt wouldn't be better here?
>
> +         /* Given copy propagation is done during interchange, we can
> +            simply check zero uses of var and eliminate it.  */
> +         if (is_gimple_assign (stmt)
> +             && !gimple_vuse (stmt)
>
> you probably meant to check gimple_vdef?
>
> +             && !gimple_has_volatile_ops (stmt)
> +             && !gimple_has_side_effects (stmt)
>
> the former is redundant
>
> +             && (lhs = gimple_assign_lhs (stmt)) != NULL_TREE
> +             && TREE_CODE (lhs) == SSA_NAME
> +             && has_zero_uses (lhs))
>
> if you use gimple_get_lhs () you can also handle calls.
>
> That said, this seems to be a very poor DCE, why is it necessary at all?
It is a very poor DCE, but enough to catch cases generated by
interchange operation.  Because there is no DCE pass between
interchange and vectorizer, I don't want to pass them to vectorizer.
It's expensive, and maybe sometimes bad code generation in vectorizer.

>
> +/* Interchange niters info of ILOOP and OLOOP while reset any other niters
> +   estimates information for now.  */
> +
> +static inline void
> +interchange_nb_iterations (struct loop *iloop, struct loop *oloop)
> +{
> +  tree nb_iterations = oloop->nb_iterations;
> +
> +  oloop->any_upper_bound = false;
> +  oloop->any_likely_upper_bound = false;
> +  free_numbers_of_iterations_estimates (oloop);
> +
> +  oloop->nb_iterations = iloop->nb_iterations;
> +
> +  iloop->any_upper_bound = false;
> +  iloop->any_likely_upper_bound = false;
> +  free_numbers_of_iterations_estimates (iloop);
> +
> +  iloop->nb_iterations = nb_iterations;
>
> use std::swap?  Also I think if you can keep nb_iterations you
> can certainly keep the upper bounds.  You're probably
> afraid of the ->stmt references in the nb_iter_bound entries?
>
> Anyway, either scrap everything or try to keep everything.
>
> +  for (i = 0; oloop.reductions.iterate (i, &re); ++i)
> +    {
> +      if (re->type != DOUBLE_RTYPE)
> +       gcc_unreachable ();
> +
> +      use_operand_p use_p;
> +      imm_use_iterator iterator;
> +      FOR_EACH_IMM_USE_FAST (use_p, iterator, re->var)
> +       mark_or_remove_dbg_stmt (USE_STMT (use_p), re->var);
> +      FOR_EACH_IMM_USE_FAST (use_p, iterator, re->next)
> +       mark_or_remove_dbg_stmt (USE_STMT (use_p), re->next);
> +      if (TREE_CODE (re->init) == SSA_NAME)
> +       {
> +         FOR_EACH_IMM_USE_FAST (use_p, iterator, re->init)
> +           mark_or_remove_dbg_stmt (USE_STMT (use_p), re->init);
> +       }
>
> can you add a comment what you are doing here?
>
> Note that other loop opts simply scrap all debug stmts ...
>
> +static void
> +compute_access_stride (struct loop *loop_nest, struct loop *loop,
> +                      data_reference_p dr)
> +{
> ...
> +  tree ref = DR_REF (dr);
> +  tree scev_base = build_fold_addr_expr (ref);
> +  tree access_size = TYPE_SIZE_UNIT (TREE_TYPE (ref));
> +  tree niters = build_int_cst (sizetype, AVG_LOOP_NITER);
> +  access_size = fold_build2 (MULT_EXPR, sizetype, niters, access_size);
> +
> +  do {
> +    tree scev_fn = analyze_scalar_evolution (loop, scev_base);
> +    if (chrec_contains_undetermined (scev_fn)
> +       || chrec_contains_symbols_defined_in_loop (scev_fn, loop->num))
> +      break;
> ...
> +    strides->safe_push (scev_step);
> +  } while (loop != loop_nest && (loop = loop_outer (loop)) != NULL);
> +
>
> I _think_ you want to do
>
>    scev_fn = analyze_scalar_evolution (loop, scev_base); // assuming
> DR_STMT (dr) is in loop
>    scev_fn = instantiate_parameters (nest, scev_fn);
>    if (chrec_contains_undetermined (scev_fn))
>      return; // false?
>
> and analyze the result which should be of the form
>
>   { { { init, +, step1 }_1, +, step2 }_2, + , step3 }_3 ...
>
> if canonical.  I think estimate_val_by_simplify_replace isn't needed
> if you do that
> (it also looks odd to replace all vairables in step by niter...).
No, estimate_val_by_simplify_replace is for cases like:

foo (int *p, int n)
{
  for (int i = 0; i < n; i++)
    for (int j = 0; j < n; j++)
      for (int k = 0; k < n; k++)
        p[k + j * n + i * n * n] += 1;
}

So the step1/step2 are linear functions of n, which are not constant.
The loop here can be seen as inlined instantiate_parameters.  Calling
it directly can't compute step1/step2 neither.
I have to admit SCEV still doesn't work very well with pointer
addresses.  Even worse, for above case, call to instantiate_parameters
gives:
{{p_21(D) + (sizetype) ((long unsigned int) ((int) ((unsigned int)
n_18(D) + 4294967295) * n_18(D)) * 4), +, (long unsigned int) (n_18(D)
* n_18(D)) * 4}_1, +, 4}_3

This isn't canonical at all, there might be something wrong in SCEV.

So if you don't mind, I tend to keep the code, computing scev for base
iteratively for each loop in nest.
Function estimate_val_by_simplify_replace does look odd, we can get
rid of it by giving up on all pointer cases.  Any preference?

>
> I think keeping the chrec in the above form is also more suitable for what
> the caller does so the outermost loop is simply
>
>   loop = loop_nest;
>   loop-over-all-dr-chrecs
>     if (flow_loop_nested_p (loop, CHREC_LOOP (chrec)))
>       loop = CHREC_LOOP (chrec);
>
> given the outermost loop is the outer evolution.  If you sort the
> stride vecs from inner
> to outer you don't need prune_access_strides_not_in_loop.
Problem is loop_nest created by data dependence analyzer is sorted
from outer to inner.  Here I sort strides too so that we don't have
trouble in should_interchange_loops and update_data_refs.

Thanks,
bin
>
> +/* Count and return the number of loops in LOOP_NEST.  */
> +
> +unsigned int
> +num_loops_in_loop_nest (struct loop *loop_nest)
> +{
> +  unsigned num_loops;
> +  for (num_loops = 0; loop_nest; num_loops++, loop_nest = loop_nest->inner)
> +    ;
> +  return num_loops;
>
> loop_depth (innermost) - loop_depth (nest)?
>
> +static bool
> +should_interchange_loops (unsigned i_idx, unsigned o_idx,
> +                         vec<data_reference_p> datarefs,
> +                         bool innermost_loops_p, bool dump_info_p = true)
> +{
>
> isn't all we need associating the above CHREC to sort after the CHREC_RIGHTs
> and figure a permutation sequence to arrive there?  That is for the local
> decision you compute here it is CHREC_RIGHT [i_idx] > CHREC_RIGHT [o_idx]
> when we should interchange?
>
> That subloop_stride_p and tracking invariant DRs looks a bit odd.  For loops
> where a DR is invariant you simply do not have an evolution in that loop.
> You seem to simply add strides in the inner and outer loops for each DR,
> can you explain how that works?  Also I guess strides bigger than the
> various cache-line size parameters should be treated 'equal'?  That is,
> if we don't get any DR to a step that results in L1 cache hits because
> two DRs share a cache line the interchange is pointless?
>
> +/* Prune DATAREFS by removing any data reference not inside of LOOP.  */
> +
> +static inline void
> +prune_datarefs_not_in_loop (struct loop *loop, vec<data_reference_p> datarefs)
> +{
> +  struct data_reference *dr;
> +
> +  for (unsigned i = 0; datarefs.iterate (i, &dr);)
> +    if (flow_bb_inside_loop_p (loop, gimple_bb (DR_STMT (dr))))
> +      i++;
> +    else
> +      {
> +       datarefs.ordered_remove (i);
>
> that's expensive.  It's better to keep moving DRs we want to keep
> when walking the array.  That is, add a j you increment only when
> we keep a DR, moving *i to *j.
>
> +       if (dr->aux)
> +         {
> +           DR_ACCESS_STRIDE (dr)->release ();
> +           free (dr->aux);
> +         }
> +       free_data_ref (dr);
> +      }
> +}
>
> +
> +  start_loop = prune_non_rectangle_loop_nest (innermost_loop, start_loop);
> +
>
> Hmm.  If you instantiate the SCEV for the niters for each loop in the nest
> wrt the nest you can figure if it has any evolution in sth else than the
> loop (evolution_function_is_univariate_p).   That is, this is not a problem
> until you arrive at analyzing DR strides, right?  At which point you
> can check for the appropriate form?
>
> +  if (find_data_references_in_loop (start_loop, datarefs) == chrec_dont_know
> +      /* Check if there is no data reference.  */
> +      || datarefs->length () == 0
> +      /* Check if there are too many of data references.  */
> +      || ((int) datarefs->length ()
> +         > PARAM_VALUE (PARAM_LOOP_MAX_DATAREFS_FOR_DATADEPS))
> +      /* Check if there is any data reference in loop latch.  We can't handle
> +        loops which loop header and data references have different execution
> +        times.  */
> +      || dataref_niters_diff_to_loop_header (*datarefs)
>
> this suggests to do your own find_data_references_in_loop so you can early
> out.
>
> Overall the flow through the pass is a bit hard to follow given there are
> IMHO too many functions.
>
> +unsigned int
> +pass_linterchange::execute (function *fun)
> +{
> +  if (number_of_loops (fun) <= 2)
> +    return 0;
> +
> +  bool changed_p = false;;
> +  struct loop *loop;
> +  vec<loop_p> loop_nest;
> +  vec<data_reference_p> datarefs;
> +  vec<ddr_p> ddrs;
> +
> +  FOR_EACH_LOOP (loop, LI_ONLY_INNERMOST)
> +    if (prepare_perfect_loop_nest (loop, &loop_nest, &datarefs, &ddrs))
> +      {
> +       tree_loop_interchange loop_interchange (loop_nest, datarefs, ddrs);
> +       changed_p |= loop_interchange.interchange ();
> +      }
>
> you leak datarefs/ddrs?
>
> Richard.
>
>
>> Thanks,
>> bin
>>
>> 2017-11-15  Bin Cheng  <bin.cheng@arm.com>
>>
>>     * Makefile.in (tree-ssa-loop-interchange.o): New object file.
>>     * common.opt (ftree-loop-interchange): New option.
>>     * doc/invoke.texi (-ftree-loop-interchange): Document new option.
>>     * gimple-iterator.c (gsi_remove): New parameter determining if dbg
>>     bind stmt is inserted or not.
>>     * gimple-iterator.h (gsi_remove): New parameter in declaration.
>>     * params.def (PARAM_LOOP_INTERCHANGE_MAX_NUM_STMTS): New parameter.
>>     (PARAM_LOOP_INTERCHANGE_STRIDE_RATIO): New parameter.
>>     * passes.def (pass_linterchange): New pass.
>>     * timevar.def (TV_LINTERCHANGE): New time var.
>>     * tree-pass.h (make_pass_linterchange): New declaration.
>>     * tree-ssa-loop-interchange.cc: New file.
>>     * tree-ssa-loop-ivcanon.c (create_canonical_iv): Change to external.
>>     Record IV before/after increment in new parameters.
>>     * tree-ssa-loop-ivopts.h (create_canonical_iv): New declaration.
>>
>> gcc/testsuite
>> 2017-11-15  Bin Cheng  <bin.cheng@arm.com>
>>
>>     * gcc.dg/tree-ssa/loop-interchange-1.c: New test.
>>     * gcc.dg/tree-ssa/loop-interchange-2.c: New test.
>>     * gcc.dg/tree-ssa/loop-interchange-3.c: New test.
>>     * gcc.dg/tree-ssa/loop-interchange-4.c: New test.
>>     * gcc.dg/tree-ssa/loop-interchange-5.c: New test.
>>     * gcc.dg/tree-ssa/loop-interchange-6.c: New test.
>>     * gcc.dg/tree-ssa/loop-interchange-7.c: New test.
>>     * gcc.dg/tree-ssa/loop-interchange-8.c: New test.
>>     * gcc.dg/tree-ssa/loop-interchange-9.c: New test.
>>     * gcc.dg/tree-ssa/loop-interchange-10.c: New test.
>>     * gcc.dg/tree-ssa/loop-interchange-11.c: New test.
>>
>>>
>>>
>>> Ciao,
>>> Michael.

     prev parent reply	other threads:[~2017-11-23 16:29 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-08-30 14:31 Bin Cheng
2017-08-30 15:11 ` Richard Biener
2017-08-30 17:03   ` Bin.Cheng
2017-09-04 13:55     ` Richard Biener
2017-09-04 14:48       ` Bin.Cheng
2017-09-22 10:25       ` Bin.Cheng
2017-10-24 14:35         ` Michael Matz
2017-11-16 15:30           ` Bin.Cheng
2017-11-20 14:59             ` Richard Biener
2017-11-23 17:01               ` Bin.Cheng [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAHFci2_7NW8eAkCEqPYYXULG3j1ROYbnroHgD5TaQxSZE+R9yg@mail.gmail.com \
    --to=amker.cheng@gmail.com \
    --cc=gcc-patches@gcc.gnu.org \
    --cc=matz@suse.de \
    --cc=richard.guenther@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).