From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 33009 invoked by alias); 23 Nov 2017 16:29:54 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Received: (qmail 32995 invoked by uid 89); 23 Nov 2017 16:29:53 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-25.4 required=5.0 tests=AWL,BAYES_00,FREEMAIL_FROM,GIT_PATCH_0,GIT_PATCH_1,GIT_PATCH_2,GIT_PATCH_3,KB_WAM_FROM_NAME_SINGLEWORD,RCVD_IN_DNSWL_NONE,SPF_PASS autolearn=ham version=3.3.2 spammy=common.opt, commonopt, UD:common.opt, vectorize X-HELO: mail-io0-f181.google.com Received: from mail-io0-f181.google.com (HELO mail-io0-f181.google.com) (209.85.223.181) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Thu, 23 Nov 2017 16:29:50 +0000 Received: by mail-io0-f181.google.com with SMTP id w127so27093974iow.11 for ; Thu, 23 Nov 2017 08:29:49 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc; bh=mdogTfWHKI0nwX4ehrWLsKi1pFS5cORAQFYM/iVvK64=; b=LqnVjNgR88eFGsEwE8W5zFdJ/HesQejISCwuNQSQEFNryy0qaiqKecGQzFw8FYcH0Z Ccd3AJngTFOT+UcsVtRun0U4WuNOEYHJwLg6tOBepqp+9hb2PnxLqp5uH9rVLTWOcSGx /6mrva+iSGf+Wf7TI5yoMzKY2GZ8p4Rn9wa0B32C3j7UwCzn+Z4lI3csvQerNUuRRL9t WiN44d4GZRfeW+WOkhCdj5lvhOosz4lh3+2WE0WeT+CFxMK3YavJ/pqEpPiRFqfWQk/f z8cm6AFLdMBilTN0r73E2frwKxo0zabB/6kA9nKLYRkBLWY+O7t25u4G9+ihsUXrWy+Y 6zbA== X-Gm-Message-State: AJaThX4G722tyav8adFVjT+1EMNzyO2cF4s7gZEBHo2gveMIrb1r6E3U ays+bMxKuk3YQEd+Br5TYryEI0EueJybGi+raJ1OsA== X-Google-Smtp-Source: AGs4zMYzts+kyCjvIKef9Tt30GozYfpBtXVYQtdB9ByLoIlkCfxuTEJEIc1DRmnUJ5urOpgUtuGnhFIY8MQWh2eOCgk= X-Received: by 10.107.186.139 with SMTP id k133mr28234204iof.121.1511454588125; Thu, 23 Nov 2017 08:29:48 -0800 (PST) MIME-Version: 1.0 Received: by 10.2.153.51 with HTTP; Thu, 23 Nov 2017 08:29:47 -0800 (PST) In-Reply-To: References: From: "Bin.Cheng" Date: Thu, 23 Nov 2017 17:01:00 -0000 Message-ID: Subject: Re: [PATCH GCC]A simple implementation of loop interchange To: Richard Biener Cc: Michael Matz , "gcc-patches@gcc.gnu.org" Content-Type: text/plain; charset="UTF-8" X-IsSubscribed: yes X-SW-Source: 2017-11/txt/msg02143.txt.bz2 Hi Richard, Thanks for reviewing. It's quite lot comment, I am trying to resolve it one by one. Here I have some questions as embedded. On Mon, Nov 20, 2017 at 2:46 PM, Richard Biener wrote: > On Thu, Nov 16, 2017 at 4:18 PM, Bin.Cheng wrote: >> On Tue, Oct 24, 2017 at 3:30 PM, Michael Matz wrote: >> >>> >>> I hope this is of some help to you :) >> Thanks again, it's very helpful. >> >> I also fixed several bugs of previous implementation, mostly about debug info >> statements and simple reductions. As for test, I enabled this pass by default, >> bootstrap and regtest GCC, I also build/run specs. There must be some other >> latent bugs in it, but guess we have to exercise it by enabling it at >> some point. >> >> So any comments? > > bool > -gsi_remove (gimple_stmt_iterator *i, bool remove_permanently) > +gsi_remove (gimple_stmt_iterator *i, bool remove_permanently, bool insert_dbg) > { > > that you need this suggests you do stmt removal in wrong order (you need to > do reverse dom order). > > +/* Maximum number of statements in loop nest for loop interchange. */ > + > +DEFPARAM (PARAM_LOOP_INTERCHANGE_MAX_NUM_STMTS, > + "loop-interchange-max-num-stmts", > + "The maximum number of stmts in loop nest for loop interchange.", > + 64, 0, 0) > > is that to limit dependence computation? In this case you should probably > limit the number of data references instead? No, this is to limit number of statements in loop. We don't want to do interchange for too large loops, right? > > +ftree-loop-interchange > +Common Report Var(flag_tree_loop_interchange) Optimization > +Enable loop interchange on trees. > + > > please re-use -floop-interchange instead and change the GRAPHITE tests > to use -floop-nest-optimize. You can do that as pre-approved thing now. > > Please enable the pass by default at O3 via opts.c. There are quite many (vectorize) test cases affected by interchange (which is correct I believe), so I will prepare another patch enabling it at O3 and adjusting the tests, to keep this patch small. > > diff --git a/gcc/tree-ssa-loop-interchange.cc b/gcc/tree-ssa-loop-interchange.cc > > gimple-loop-interchange.cc please. > > new file mode 100644 > index 0000000..abffbf6 > --- /dev/null > +++ b/gcc/tree-ssa-loop-interchange.cc > @@ -0,0 +1,2274 @@ > +/* Loop invariant motion. > + Copyright (C) 2017 Free Software Foundation, Inc. > > Loop invariant motion? ... ;) > > Please add a "Contributed by ..." to have an easy way to figure people to blame. > > +}*induction_p; > + > > space after '*' > > +}*reduction_p; > + > > likewise. > > +/* Return true if PHI is unsupported in loop interchange, i.e, PHI contains > + ssa var appearing in any abnormal phi node. */ > + > +static inline bool > +unsupported_phi_node (gphi *phi) > +{ > + if (SSA_NAME_OCCURS_IN_ABNORMAL_PHI (PHI_RESULT (phi))) > + return true; > + > + for (unsigned i = 0; i < gimple_phi_num_args (phi); ++i) > + { > + tree arg = PHI_ARG_DEF (phi, i); > + if (TREE_CODE (arg) == SSA_NAME > + && SSA_NAME_OCCURS_IN_ABNORMAL_PHI (arg)) > + return true; > + } > + > + return false; > > I believe the above isn't necessary given you rule out abnormal edges > into the loop. > Did you have a testcase that broke? A minor thing I guess if it is > just for extra > safety... > > +/* Return true if all stmts in BB can be supported by loop interchange, > + otherwise return false. ILOOP is not NULL if this loop_cand is the > + outer loop in loop nest. */ > + > +bool > +loop_cand::unsupported_operation (basic_block bb, loop_cand *iloop) > +{ > > docs and return value suggest this be named supported_operation > > + /* Or it's invariant memory reference and only used by inner loop. */ > + if (gimple_assign_single_p (stmt) > + && (lhs = gimple_assign_lhs (stmt)) != NULL_TREE > + && TREE_CODE (lhs) == SSA_NAME > + && single_use_in_loop (lhs, iloop->loop)) > + continue; > > comment suggests multiple uses in loop would be ok? > > + if ((lhs = gimple_assign_lhs (producer)) == NULL_TREE > + || lhs != re->init) > + return; > + > + if ((rhs = gimple_assign_rhs1 (producer)) == NULL_TREE > + || !REFERENCE_CLASS_P (rhs)) > + return; > > lhs and rhs are never NULL. Please initialize them outside of the if. > You want to disallow DECL_P rhs here? > > Can you add an overall function comment what this function does? It seems > to detect a reduction as produced by loop store-motion? Thus it tries to > get at enough information to perform the reverse transform? > > During review I have a hard time distinguishing between locals and members > so can you please prefix all member variables with m_ as according to our > code guidelines? I guess what adds to the confusion is the loop_cand argument > that sometimes is present for loop_cand member functions... > (I personally prefer to prefix all member accesses with this-> but that's harder > to enforce) > > +static void > +find_deps_in_bb_for_stmt (gimple_seq *stmts, basic_block bb, gimple *consumer) > +{ > > this extracts stmts starting from consumer but rules out other > consumers (recursively). > Is that intended? I wonder if you can avoid this dance... > > + /* Transform simple reduction of below form: > + > + init = 0; > + loop: > + var = phi > + next = var op ... > + reduc_sum = phi > + MEM_REF[...] = reduc_sum > + > + into: > + > > which one is 'consumer'? I wonder if you can simply leave all the > dead code in place > just emitting the load-update-store stmts and removing the > MEM_REF[...] = reduc_sum > above? > > +/* Eliminate dead code after loop interchange. */ > + > +void > +loop_cand::eliminate_dead_code (void) > +{ > > PRE tracks "possible" dead defs and has a worklist algorithm in > remove_dead_inserted_code. > I wonder if you can do sth similar? That is, I wonder if doing a > sweep from last to first > stmt wouldn't be better here? > > + /* Given copy propagation is done during interchange, we can > + simply check zero uses of var and eliminate it. */ > + if (is_gimple_assign (stmt) > + && !gimple_vuse (stmt) > > you probably meant to check gimple_vdef? > > + && !gimple_has_volatile_ops (stmt) > + && !gimple_has_side_effects (stmt) > > the former is redundant > > + && (lhs = gimple_assign_lhs (stmt)) != NULL_TREE > + && TREE_CODE (lhs) == SSA_NAME > + && has_zero_uses (lhs)) > > if you use gimple_get_lhs () you can also handle calls. > > That said, this seems to be a very poor DCE, why is it necessary at all? It is a very poor DCE, but enough to catch cases generated by interchange operation. Because there is no DCE pass between interchange and vectorizer, I don't want to pass them to vectorizer. It's expensive, and maybe sometimes bad code generation in vectorizer. > > +/* Interchange niters info of ILOOP and OLOOP while reset any other niters > + estimates information for now. */ > + > +static inline void > +interchange_nb_iterations (struct loop *iloop, struct loop *oloop) > +{ > + tree nb_iterations = oloop->nb_iterations; > + > + oloop->any_upper_bound = false; > + oloop->any_likely_upper_bound = false; > + free_numbers_of_iterations_estimates (oloop); > + > + oloop->nb_iterations = iloop->nb_iterations; > + > + iloop->any_upper_bound = false; > + iloop->any_likely_upper_bound = false; > + free_numbers_of_iterations_estimates (iloop); > + > + iloop->nb_iterations = nb_iterations; > > use std::swap? Also I think if you can keep nb_iterations you > can certainly keep the upper bounds. You're probably > afraid of the ->stmt references in the nb_iter_bound entries? > > Anyway, either scrap everything or try to keep everything. > > + for (i = 0; oloop.reductions.iterate (i, &re); ++i) > + { > + if (re->type != DOUBLE_RTYPE) > + gcc_unreachable (); > + > + use_operand_p use_p; > + imm_use_iterator iterator; > + FOR_EACH_IMM_USE_FAST (use_p, iterator, re->var) > + mark_or_remove_dbg_stmt (USE_STMT (use_p), re->var); > + FOR_EACH_IMM_USE_FAST (use_p, iterator, re->next) > + mark_or_remove_dbg_stmt (USE_STMT (use_p), re->next); > + if (TREE_CODE (re->init) == SSA_NAME) > + { > + FOR_EACH_IMM_USE_FAST (use_p, iterator, re->init) > + mark_or_remove_dbg_stmt (USE_STMT (use_p), re->init); > + } > > can you add a comment what you are doing here? > > Note that other loop opts simply scrap all debug stmts ... > > +static void > +compute_access_stride (struct loop *loop_nest, struct loop *loop, > + data_reference_p dr) > +{ > ... > + tree ref = DR_REF (dr); > + tree scev_base = build_fold_addr_expr (ref); > + tree access_size = TYPE_SIZE_UNIT (TREE_TYPE (ref)); > + tree niters = build_int_cst (sizetype, AVG_LOOP_NITER); > + access_size = fold_build2 (MULT_EXPR, sizetype, niters, access_size); > + > + do { > + tree scev_fn = analyze_scalar_evolution (loop, scev_base); > + if (chrec_contains_undetermined (scev_fn) > + || chrec_contains_symbols_defined_in_loop (scev_fn, loop->num)) > + break; > ... > + strides->safe_push (scev_step); > + } while (loop != loop_nest && (loop = loop_outer (loop)) != NULL); > + > > I _think_ you want to do > > scev_fn = analyze_scalar_evolution (loop, scev_base); // assuming > DR_STMT (dr) is in loop > scev_fn = instantiate_parameters (nest, scev_fn); > if (chrec_contains_undetermined (scev_fn)) > return; // false? > > and analyze the result which should be of the form > > { { { init, +, step1 }_1, +, step2 }_2, + , step3 }_3 ... > > if canonical. I think estimate_val_by_simplify_replace isn't needed > if you do that > (it also looks odd to replace all vairables in step by niter...). No, estimate_val_by_simplify_replace is for cases like: foo (int *p, int n) { for (int i = 0; i < n; i++) for (int j = 0; j < n; j++) for (int k = 0; k < n; k++) p[k + j * n + i * n * n] += 1; } So the step1/step2 are linear functions of n, which are not constant. The loop here can be seen as inlined instantiate_parameters. Calling it directly can't compute step1/step2 neither. I have to admit SCEV still doesn't work very well with pointer addresses. Even worse, for above case, call to instantiate_parameters gives: {{p_21(D) + (sizetype) ((long unsigned int) ((int) ((unsigned int) n_18(D) + 4294967295) * n_18(D)) * 4), +, (long unsigned int) (n_18(D) * n_18(D)) * 4}_1, +, 4}_3 This isn't canonical at all, there might be something wrong in SCEV. So if you don't mind, I tend to keep the code, computing scev for base iteratively for each loop in nest. Function estimate_val_by_simplify_replace does look odd, we can get rid of it by giving up on all pointer cases. Any preference? > > I think keeping the chrec in the above form is also more suitable for what > the caller does so the outermost loop is simply > > loop = loop_nest; > loop-over-all-dr-chrecs > if (flow_loop_nested_p (loop, CHREC_LOOP (chrec))) > loop = CHREC_LOOP (chrec); > > given the outermost loop is the outer evolution. If you sort the > stride vecs from inner > to outer you don't need prune_access_strides_not_in_loop. Problem is loop_nest created by data dependence analyzer is sorted from outer to inner. Here I sort strides too so that we don't have trouble in should_interchange_loops and update_data_refs. Thanks, bin > > +/* Count and return the number of loops in LOOP_NEST. */ > + > +unsigned int > +num_loops_in_loop_nest (struct loop *loop_nest) > +{ > + unsigned num_loops; > + for (num_loops = 0; loop_nest; num_loops++, loop_nest = loop_nest->inner) > + ; > + return num_loops; > > loop_depth (innermost) - loop_depth (nest)? > > +static bool > +should_interchange_loops (unsigned i_idx, unsigned o_idx, > + vec datarefs, > + bool innermost_loops_p, bool dump_info_p = true) > +{ > > isn't all we need associating the above CHREC to sort after the CHREC_RIGHTs > and figure a permutation sequence to arrive there? That is for the local > decision you compute here it is CHREC_RIGHT [i_idx] > CHREC_RIGHT [o_idx] > when we should interchange? > > That subloop_stride_p and tracking invariant DRs looks a bit odd. For loops > where a DR is invariant you simply do not have an evolution in that loop. > You seem to simply add strides in the inner and outer loops for each DR, > can you explain how that works? Also I guess strides bigger than the > various cache-line size parameters should be treated 'equal'? That is, > if we don't get any DR to a step that results in L1 cache hits because > two DRs share a cache line the interchange is pointless? > > +/* Prune DATAREFS by removing any data reference not inside of LOOP. */ > + > +static inline void > +prune_datarefs_not_in_loop (struct loop *loop, vec datarefs) > +{ > + struct data_reference *dr; > + > + for (unsigned i = 0; datarefs.iterate (i, &dr);) > + if (flow_bb_inside_loop_p (loop, gimple_bb (DR_STMT (dr)))) > + i++; > + else > + { > + datarefs.ordered_remove (i); > > that's expensive. It's better to keep moving DRs we want to keep > when walking the array. That is, add a j you increment only when > we keep a DR, moving *i to *j. > > + if (dr->aux) > + { > + DR_ACCESS_STRIDE (dr)->release (); > + free (dr->aux); > + } > + free_data_ref (dr); > + } > +} > > + > + start_loop = prune_non_rectangle_loop_nest (innermost_loop, start_loop); > + > > Hmm. If you instantiate the SCEV for the niters for each loop in the nest > wrt the nest you can figure if it has any evolution in sth else than the > loop (evolution_function_is_univariate_p). That is, this is not a problem > until you arrive at analyzing DR strides, right? At which point you > can check for the appropriate form? > > + if (find_data_references_in_loop (start_loop, datarefs) == chrec_dont_know > + /* Check if there is no data reference. */ > + || datarefs->length () == 0 > + /* Check if there are too many of data references. */ > + || ((int) datarefs->length () > + > PARAM_VALUE (PARAM_LOOP_MAX_DATAREFS_FOR_DATADEPS)) > + /* Check if there is any data reference in loop latch. We can't handle > + loops which loop header and data references have different execution > + times. */ > + || dataref_niters_diff_to_loop_header (*datarefs) > > this suggests to do your own find_data_references_in_loop so you can early > out. > > Overall the flow through the pass is a bit hard to follow given there are > IMHO too many functions. > > +unsigned int > +pass_linterchange::execute (function *fun) > +{ > + if (number_of_loops (fun) <= 2) > + return 0; > + > + bool changed_p = false;; > + struct loop *loop; > + vec loop_nest; > + vec datarefs; > + vec ddrs; > + > + FOR_EACH_LOOP (loop, LI_ONLY_INNERMOST) > + if (prepare_perfect_loop_nest (loop, &loop_nest, &datarefs, &ddrs)) > + { > + tree_loop_interchange loop_interchange (loop_nest, datarefs, ddrs); > + changed_p |= loop_interchange.interchange (); > + } > > you leak datarefs/ddrs? > > Richard. > > >> Thanks, >> bin >> >> 2017-11-15 Bin Cheng >> >> * Makefile.in (tree-ssa-loop-interchange.o): New object file. >> * common.opt (ftree-loop-interchange): New option. >> * doc/invoke.texi (-ftree-loop-interchange): Document new option. >> * gimple-iterator.c (gsi_remove): New parameter determining if dbg >> bind stmt is inserted or not. >> * gimple-iterator.h (gsi_remove): New parameter in declaration. >> * params.def (PARAM_LOOP_INTERCHANGE_MAX_NUM_STMTS): New parameter. >> (PARAM_LOOP_INTERCHANGE_STRIDE_RATIO): New parameter. >> * passes.def (pass_linterchange): New pass. >> * timevar.def (TV_LINTERCHANGE): New time var. >> * tree-pass.h (make_pass_linterchange): New declaration. >> * tree-ssa-loop-interchange.cc: New file. >> * tree-ssa-loop-ivcanon.c (create_canonical_iv): Change to external. >> Record IV before/after increment in new parameters. >> * tree-ssa-loop-ivopts.h (create_canonical_iv): New declaration. >> >> gcc/testsuite >> 2017-11-15 Bin Cheng >> >> * gcc.dg/tree-ssa/loop-interchange-1.c: New test. >> * gcc.dg/tree-ssa/loop-interchange-2.c: New test. >> * gcc.dg/tree-ssa/loop-interchange-3.c: New test. >> * gcc.dg/tree-ssa/loop-interchange-4.c: New test. >> * gcc.dg/tree-ssa/loop-interchange-5.c: New test. >> * gcc.dg/tree-ssa/loop-interchange-6.c: New test. >> * gcc.dg/tree-ssa/loop-interchange-7.c: New test. >> * gcc.dg/tree-ssa/loop-interchange-8.c: New test. >> * gcc.dg/tree-ssa/loop-interchange-9.c: New test. >> * gcc.dg/tree-ssa/loop-interchange-10.c: New test. >> * gcc.dg/tree-ssa/loop-interchange-11.c: New test. >> >>> >>> >>> Ciao, >>> Michael.