From: Richard Biener <richard.guenther@gmail.com>
To: Bin Cheng <Bin.Cheng@arm.com>
Cc: "gcc-patches@gcc.gnu.org" <gcc-patches@gcc.gnu.org>,
nd <nd@arm.com>, "matz@suse.de" <matz@suse.de>
Subject: Re: [PATCH GCC][V2]A simple implementation of loop interchange
Date: Thu, 30 Nov 2017 13:19:00 -0000 [thread overview]
Message-ID: <CAFiYyc12RH3VLjs3vxjcSwWA5CKXWOzbU2Bf5eBh5dzEEaeTPg@mail.gmail.com> (raw)
In-Reply-To: <HE1PR0801MB2746BB48EAA2B7810B08525BE73A0@HE1PR0801MB2746.eurprd08.prod.outlook.com>
On Tue, Nov 28, 2017 at 4:26 PM, Bin Cheng <Bin.Cheng@arm.com> wrote:
> Hi,
> This is updated patch with review comments resolved. Some explanation embedded below.
>
> On Mon, Nov 20, 2017 at 2:46 PM, Richard Biener <richard.guenther@gmail.com> wrote:
>> On Thu, Nov 16, 2017 at 4:18 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
>>> On Tue, Oct 24, 2017 at 3:30 PM, Michael Matz <matz@suse.de> wrote:
>>>> Hello,
>>>>
>>>> On Fri, 22 Sep 2017, Bin.Cheng wrote:
>>>>
>>>>> This is updated patch for loop interchange with review suggestions
>>>>> resolved. Changes are:
>>>>> 1) It does more light weight checks like rectangle loop nest check
>>>>> earlier than before.
>>>>> 2) It checks profitability of interchange before data dependence computation.
>>>>> 3) It calls find_data_references_in_loop only once for a loop nest now.
>>>>> 4) Data dependence is open-computed so that we can skip instantly at
>>>>> unknown dependence.
>>>>> 5) It improves code generation in mapping induction variables for
>>>>> loop nest, as well as
>>>>> adding a simple dead code elimination pass.
>>>>> 6) It changes magic constants into parameters.
>>>>
>>>> So I have a couple comments/questions. Something stylistic:
>>> Hi Michael,
>>> Thanks for reviewing.
>>>
>>>>
>>>>> +class loop_cand
>>>>> +{
>>>>> +public:
>>>>> ...
>>>>> + friend class tree_loop_interchange;
>>>>> +private:
>>>>
>>>> Just make this all public (and hence a struct, not class).
>>>> No need for friends in file local classes.
>>> Done.
>>>
>>>>
>>>>> +single_use_in_loop (tree var, struct loop *loop)
>>>>> ...
>>>>> + FOR_EACH_IMM_USE_FAST (use_p, iterator, var)
>>>>> + {
>>>>> + stmt = USE_STMT (use_p);
>>>>> ...
>>>>> + basic_block bb = gimple_bb (stmt);
>>>>> + gcc_assert (bb != NULL);
>>>>
>>>> This pattern reoccurs often in your patch: you check for a bb associated
>>>> for a USE_STMT. Uses of SSA names always occur in basic blocks, no need
>>>> for checking.
>>> Done.
>>>
>>>>
>>>> Then, something about your handling of simple reductions:
>>>>
>>>>> +void
>>>>> +loop_cand::classify_simple_reduction (reduction_p re)
>>>>> +{
>>>>> ...
>>>>> + /* Require memory references in producer and consumer are the same so
>>>>> + that we can undo reduction during interchange. */
>>>>> + if (re->init_ref && !operand_equal_p (re->init_ref, re->fini_ref, 0))
>>>>> + return;
>>>>
>>>> Where is it checked that the undoing transformation is legal also
>>>> from a data dep point of view? Think code like this:
>>>>
>>>> sum = X[i];
>>>> for (j ...)
>>>> sum += X[j];
>>>> X[i] = sum;
>>>>
>>>> Moving the store into the inner loop isn't always correct and I don't seem
>>>> to find where the above situation is rejected.
>>> Yeah. for the old patch, it's possible to have such loop wrongly interchanged;
>>> in practice, it's hard to create an example. The pass will give up
>>> when computing
>>> data dep between references in inner/outer loops. In this updated
>>> patch, it's fixed
>>> by giving up if there is any dependence between references of inner/outer loops.
>>>
>>>>
>>>> Maybe I'm confused because I also don't see where you even can get into
>>>> the above situation (though I do see testcases about this). The thing is,
>>>> for an 2d loop nest to contain something like the above reduction it can't
>>>> be perfect:
>>>>
>>>> for (j) {
>>>> int sum = X[j]; // 1
>>>> for (i)
>>>> sum += Y[j][i];
>>>> X[j] = sum; // 2
>>>> }
>>>>
>>>> But you do check for perfectness in proper_loop_form_for_interchange and
>>>> prepare_perfect_loop_nest, so either you can't get into the situation or
>>>> the checking can't be complete, or you define the above to be perfect
>>>> nevertheless (probably because the load and store are in outer loop
>>>> header/exit blocks?). The latter would mean that you accept also other
>>>> code in header/footer of loops from a pure CFG perspective, so where is it
>>>> checked that that other code (which aren't simple reductions) isn't
>>>> harmful to the transformation?
>>> Yes, I used the name perfect loop nest, but the pass can handle special form
>>> imperfect loop nest for the simple reduction. I added comments describing
>>> this before function prepare_perfect_loop_nest.
>>>
>>>>
>>>> Then, the data dependence part of the new pass:
>>>>
>>>>> +bool
>>>>> +tree_loop_interchange::valid_data_dependences (unsigned inner, unsigned outer)
>>>>> +{
>>>>> + struct data_dependence_relation *ddr;
>>>>> +
>>>>> + for (unsigned i = 0; ddrs.iterate (i, &ddr); ++i)
>>>>> + {
>>>>> + /* Skip no-dependence case. */
>>>>> + if (DDR_ARE_DEPENDENT (ddr) == chrec_known)
>>>>> + continue;
>>>>> +
>>>>> + for (unsigned j = 0; j < DDR_NUM_DIR_VECTS (ddr); ++j)
>>>>> + {
>>>>> + lambda_vector dist_vect = DDR_DIST_VECT (ddr, j);
>>>>> + unsigned level = dependence_level (dist_vect, loop_nest.length ());
>>>>> +
>>>>> + /* If there is no carried dependence. */
>>>>> + if (level == 0)
>>>>> + continue;
>>>>> +
>>>>> + level --;
>>>>> + /* Skip case which has '>' as the leftmost direction. */
>>>>> + if (!lambda_vector_lexico_pos (dist_vect, level))
>>>>> + return false;
>>>>
>>>> Shouldn't happen as dist vectors are forced positive via DDR_REVERSED.
>>> Done.
>>>
>>>>
>>>>> + /* If dependence is carried by outer loop of the two loops for
>>>>> + interchange. */
>>>>> + if (level < outer)
>>>>> + continue;
>>>>> +
>>>>> + lambda_vector dir_vect = DDR_DIR_VECT (ddr, j);
>>>>> + /* If directions at both inner/outer levels are the same. */
>>>>> + if (dir_vect[inner] == dir_vect[outer])
>>>>> + continue;
>>>>> +
>>>>> + /* Be conservative, skip case if either direction at inner/outer
>>>>> + levels is not '=' or '<'. */
>>>>> + if (dir_vect[inner] != dir_equal
>>>>> + && dir_vect[inner] != dir_positive
>>>>> + && dir_vect[inner] != dir_independent
>>>>> + && dir_vect[inner] != dir_positive_or_equal)
>>>>> + return false;
>>>>> +
>>>>> + if (dir_vect[outer] != dir_equal
>>>>> + && dir_vect[outer] != dir_positive
>>>>> + && dir_vect[outer] != dir_independent
>>>>> + && dir_vect[outer] != dir_positive_or_equal)
>>>>> + return false;
>>>>
>>>> Checking dir vectors doesn't make much sense in GCC: the elements are only
>>>> ever set to dir_positive, dir_negative or dir_equal, exactly when distance
>>>> is
>>>> > 0, < 0 or == 0. So checking dist vector is enough. (though sameness of
>>>> direction checks sameness of sign with zero). Incidentally:
>>> Done.
>>>
>>>>
>>>>> +tree_loop_interchange::update_data_deps (unsigned inner, unsigned outer)
>>>>> +{
>>>>> + struct data_dependence_relation *ddr;
>>>>> +
>>>>> + for (unsigned i = 0; ddrs.iterate (i, &ddr); ++i)
>>>>> + {
>>>>> + /* Skip no-dependence case. */
>>>>> + if (DDR_ARE_DEPENDENT (ddr) == chrec_known)
>>>>> + continue;
>>>>> +
>>>>> + for (unsigned j = 0; j < DDR_NUM_DIR_VECTS (ddr); ++j)
>>>>> + {
>>>>> + lambda_vector dir_vect = DDR_DIR_VECT (ddr, j);
>>>>> + std::swap (dir_vect[inner], dir_vect[outer]);
>>>>> + }
>>>>> + }
>>>>> +}
>>>>
>>>> Here you swap only the direction but not the distance vector, which can't
>>>> be right. I suggest only using (and updating) the distance vector.
>>> Yeah, fixed.
>>>
>>>>
>>>> And then your usage and update of DR_ACCESS_FNs: there's quite some
>>>> complexity connected with that and I'm not sure how worthwhile it is.
>>>> You're basically using the ACCESS_FNs to determine profitability (and not
>>>> for validity, and that's good). But e.g. for pointer based accesses like
>>>> in fortran with explicit address arithmetic the relation between access-fn
>>>> step and stride and actual access stride isn't that easy (e.g. in your
>>>> should_interchange_loops function iloop_stride and oloop_stride will
>>>> always be one for pointer based accesses).
>>>>
>>>> Conceptually what you should check is how the access address for each data
>>>> ref revolves for each loop, so why not doing this explicitely? What I
>>>> mean is: calculate a (complicated) chrec for the DR addresses for the
>>>> whole nest at the beginning. It should be in the form like (assume "+"
>>>> always):
>>>>
>>>> {{{init, s1}_l1, s2}_l2, s3}_l3
>>>>
>>>> (i.e. all steps should be invariants/constants, and only one non-chrec
>>>> init value). Addresses which aren't in this form you're already ignoring
>>>> right now, so you could continue doing that. (Or better said, all
>>>> non-constant steps you regard as being AVG_DIM_SIZE, which you still can
>>>> continue doing).
>>>>
>>>> Now, with the above form you can form expressions for the difference
>>>> between addresses per iteration for each loop (i.e. the address stride per
>>>> loop); store these. Then, when interchanging loops you need to merely
>>>> swap these expressions like you have to with the distance vector, instead
>>>> of fiddling inside the DR_ACCESS_FNs themself. Much code would go away.
>>> Yeah. Did similar thing in loop nest distribution pass. See
>>> compute_access_range
>>> in tree-loop-distribution.c. Actually, I would do the same here if I
>>> had implemented
>>> this pass after loop nest distribution patches. Done in this updated patch.
>>>
>>>>
>>>> Testcases: given that we had to remove our old separate interchange pass
>>>> because it miscompiled stuff all over I'm missing some testcases where
>>>> interchange should _not_ happen for validity reasons, like my above
>>>> example with an reduction that can't be moved inside. Perhaps you can
>>>> think of some more.
>>> As mentioned above, it's hard to create test that fail exactly for this reason.
>>> I added one that data dependence prevents us from interchanging the loop.
>>>
>>>>
>>>> I hope this is of some help to you :)
>>> Thanks again, it's very helpful.
>>>
>>> I also fixed several bugs of previous implementation, mostly about debug info
>>> statements and simple reductions. As for test, I enabled this pass by default,
>>> bootstrap and regtest GCC, I also build/run specs. There must be some other
>>> latent bugs in it, but guess we have to exercise it by enabling it at
>>> some point.
>>>
>>> So any comments?
>>
>> bool
>> -gsi_remove (gimple_stmt_iterator *i, bool remove_permanently)
>> +gsi_remove (gimple_stmt_iterator *i, bool remove_permanently, bool insert_dbg)
>> {
>>
>> that you need this suggests you do stmt removal in wrong order (you need to
>> do reverse dom order).
> As below code in handling debug uses, this updated patch gives up on more cases
> by scrapping debug uses now. Hopefully this isn't a problem, debugging experience
> for interchange loops is bad already?
>>
>> +/* Maximum number of statements in loop nest for loop interchange. */
>> +
>> +DEFPARAM (PARAM_LOOP_INTERCHANGE_MAX_NUM_STMTS,
>> + "loop-interchange-max-num-stmts",
>> + "The maximum number of stmts in loop nest for loop interchange.",
>> + 64, 0, 0)
>>
>> is that to limit dependence computation? In this case you should probably
>> limit the number of data references instead?
> Hmm, I kept this one and it is to limit the size of loops for interchange.
>
>>
>> +ftree-loop-interchange
>> +Common Report Var(flag_tree_loop_interchange) Optimization
>> +Enable loop interchange on trees.
>> +
>>
>> please re-use -floop-interchange instead and change the GRAPHITE tests
>> to use -floop-nest-optimize. You can do that as pre-approved thing now.
> Done. I will send an independent patch adjusting GRAPHITE tests.
>
>>
>> Please enable the pass by default at O3 via opts.c.
> I will do it in a separated patch because many vectorization tests are vulnerable
> to interchange. I checked these tests, interchange is good, we need to disable
> explicitly.
>
>>
>> diff --git a/gcc/tree-ssa-loop-interchange.cc b/gcc/tree-ssa-loop-interchange.cc
>>
>> gimple-loop-interchange.cc please.
>>
>> new file mode 100644
>> index 0000000..abffbf6
>> --- /dev/null
>> +++ b/gcc/tree-ssa-loop-interchange.cc
>> @@ -0,0 +1,2274 @@
>> +/* Loop invariant motion.
>> + Copyright (C) 2017 Free Software Foundation, Inc.
>>
>> Loop invariant motion? ... ;)
>>
>> Please add a "Contributed by ..." to have an easy way to figure people to blame.
>>
>> +}*induction_p;
>> +
>>
>> space after '*'
>>
>> +}*reduction_p;
>> +
>>
>> likewise.
> All done.
>
>>
>> +/* Return true if PHI is unsupported in loop interchange, i.e, PHI contains
>> + ssa var appearing in any abnormal phi node. */
>> +
>> +static inline bool
>> +unsupported_phi_node (gphi *phi)
>> +{
>> + if (SSA_NAME_OCCURS_IN_ABNORMAL_PHI (PHI_RESULT (phi)))
>> + return true;
>> +
>> + for (unsigned i = 0; i < gimple_phi_num_args (phi); ++i)
>> + {
>> + tree arg = PHI_ARG_DEF (phi, i);
>> + if (TREE_CODE (arg) == SSA_NAME
>> + && SSA_NAME_OCCURS_IN_ABNORMAL_PHI (arg))
>> + return true;
>> + }
>> +
>> + return false;
>>
>> I believe the above isn't necessary given you rule out abnormal edges
>> into the loop.
>> Did you have a testcase that broke? A minor thing I guess if it is
>> just for extra
>> safety...
> Extra safety I guess. I now remove this and haven't run into compilation issues so far.
>
>>
>> +/* Return true if all stmts in BB can be supported by loop interchange,
>> + otherwise return false. ILOOP is not NULL if this loop_cand is the
>> + outer loop in loop nest. */
>> +
>> +bool
>> +loop_cand::unsupported_operation (basic_block bb, loop_cand *iloop)
>> +{
>>
>> docs and return value suggest this be named supported_operation
> Done.
>
>>
>> + /* Or it's invariant memory reference and only used by inner loop. */
>> + if (gimple_assign_single_p (stmt)
>> + && (lhs = gimple_assign_lhs (stmt)) != NULL_TREE
>> + && TREE_CODE (lhs) == SSA_NAME
>> + && single_use_in_loop (lhs, iloop->loop))
>> + continue;
>>
>> comment suggests multiple uses in loop would be ok?
> Comment changed.
>
>>
>> + if ((lhs = gimple_assign_lhs (producer)) == NULL_TREE
>> + || lhs != re->init)
>> + return;
>> +
>> + if ((rhs = gimple_assign_rhs1 (producer)) == NULL_TREE
>> + || !REFERENCE_CLASS_P (rhs))
>> + return;
>>
>> lhs and rhs are never NULL. Please initialize them outside of the if.
>> You want to disallow DECL_P rhs here?
> Done.
>
>>
>> Can you add an overall function comment what this function does? It seems
>> to detect a reduction as produced by loop store-motion? Thus it tries to
>> get at enough information to perform the reverse transform?
> Yes. Comment added.
>
>>
>> During review I have a hard time distinguishing between locals and members
>> so can you please prefix all member variables with m_ as according to our
>> code guidelines? I guess what adds to the confusion is the loop_cand argument
>> that sometimes is present for loop_cand member functions...
>> (I personally prefer to prefix all member accesses with this-> but that's harder
>> to enforce)
> Done by using m_* stuff.
>
>>
>> +static void
>> +find_deps_in_bb_for_stmt (gimple_seq *stmts, basic_block bb, gimple *consumer)
>> +{
>>
>> this extracts stmts starting from consumer but rules out other
>> consumers (recursively).
>> Is that intended? I wonder if you can avoid this dance...
> Okay, so in case the reduction is initialized from constant value, we need to generate new
> load memory reference when undoing it. The new memory reference may depends on idx
> variables like in ARRAY_REF. This function tries to find all depended stmts for inserting it
> I didn't look into how GRAPHITE tracks the idx variable and regenerate it when necessary,
> maybe we can do the same in the future.
Ok.
>>
>> + /* Transform simple reduction of below form:
>> +
>> + init = 0;
>> + loop:
>> + var = phi<init, next>
>> + next = var op ...
>> + reduc_sum = phi<next>
>> + MEM_REF[...] = reduc_sum
>> +
>> + into:
>> +
>>
>> which one is 'consumer'? I wonder if you can simply leave all the
>> dead code in place
>> just emitting the load-update-store stmts and removing the
>> MEM_REF[...] = reduc_sum
>> above?
> Done. I simplified the code generation for the pass and uses prerequisite simple dce
> interface to remove dead code. Hope it's clearer now.
>
>>
>> +/* Eliminate dead code after loop interchange. */
>> +
>> +void
>> +loop_cand::eliminate_dead_code (void)
>> +{
>>
>> PRE tracks "possible" dead defs and has a worklist algorithm in
>> remove_dead_inserted_code.
>> I wonder if you can do sth similar? That is, I wonder if doing a
>> sweep from last to first
>> stmt wouldn't be better here?
>>
>> + /* Given copy propagation is done during interchange, we can
>> + simply check zero uses of var and eliminate it. */
>> + if (is_gimple_assign (stmt)
>> + && !gimple_vuse (stmt)
>>
>> you probably meant to check gimple_vdef?
>>
>> + && !gimple_has_volatile_ops (stmt)
>> + && !gimple_has_side_effects (stmt)
>>
>> the former is redundant
>>
>> + && (lhs = gimple_assign_lhs (stmt)) != NULL_TREE
>> + && TREE_CODE (lhs) == SSA_NAME
>> + && has_zero_uses (lhs))
>>
>> if you use gimple_get_lhs () you can also handle calls.
>>
>> That said, this seems to be a very poor DCE, why is it necessary at all?
> Local DCE removed by reusing new interface simple_dce_from_worklist.
> But the DCE is necessary giving dead code lasts to vectorizer. At least we
> can save compilation time.
>
>>
>> +/* Interchange niters info of ILOOP and OLOOP while reset any other niters
>> + estimates information for now. */
>> +
>> +static inline void
>> +interchange_nb_iterations (struct loop *iloop, struct loop *oloop)
>> +{
>> + tree nb_iterations = oloop->nb_iterations;
>> +
>> + oloop->any_upper_bound = false;
>> + oloop->any_likely_upper_bound = false;
>> + free_numbers_of_iterations_estimates (oloop);
>> +
>> + oloop->nb_iterations = iloop->nb_iterations;
>> +
>> + iloop->any_upper_bound = false;
>> + iloop->any_likely_upper_bound = false;
>> + free_numbers_of_iterations_estimates (iloop);
>> +
>> + iloop->nb_iterations = nb_iterations;
>>
>> use std::swap? Also I think if you can keep nb_iterations you
>> can certainly keep the upper bounds. You're probably
>> afraid of the ->stmt references in the nb_iter_bound entries?
>>
>> Anyway, either scrap everything or try to keep everything.
> Yeah, not only the stmts, but also the control_iv information because the SCEV
> information may be corrupted during code transformation.
> Now I discarded all the information.
Note that given you interchange the loops but not the CFG or the loop structures
you might want to swap loop->num and flags like ->force_vectorize. That is,
essentially change the ->header/latch association (and other CFG related stuff
like recorded exits).
It might also be we want to / need to disable interchange for, say,
->force_vectorize
inner loops or ->unroll != 0? Or we need to clear them, maybe
optionally diagnosing
that fact.
At least we need to think about what it means to preserve loop
structure (semantically,
loop->num should maintain association to the same source-level loop
throughout the
compilation) for transforms like interchange.
>>
>> + for (i = 0; oloop.reductions.iterate (i, &re); ++i)
>> + {
>> + if (re->type != DOUBLE_RTYPE)
>> + gcc_unreachable ();
>> +
>> + use_operand_p use_p;
>> + imm_use_iterator iterator;
>> + FOR_EACH_IMM_USE_FAST (use_p, iterator, re->var)
>> + mark_or_remove_dbg_stmt (USE_STMT (use_p), re->var);
>> + FOR_EACH_IMM_USE_FAST (use_p, iterator, re->next)
>> + mark_or_remove_dbg_stmt (USE_STMT (use_p), re->next);
>> + if (TREE_CODE (re->init) == SSA_NAME)
>> + {
>> + FOR_EACH_IMM_USE_FAST (use_p, iterator, re->init)
>> + mark_or_remove_dbg_stmt (USE_STMT (use_p), re->init);
>> + }
>>
>> can you add a comment what you are doing here?
>>
>> Note that other loop opts simply scrap all debug stmts ...
> As mentioned above, updated patch doesn't try hard to maintain debug use info any more.
>
>>
>> +static void
>> +compute_access_stride (struct loop *loop_nest, struct loop *loop,
>> + data_reference_p dr)
>> +{
>> ...
>> + tree ref = DR_REF (dr);
>> + tree scev_base = build_fold_addr_expr (ref);
>> + tree access_size = TYPE_SIZE_UNIT (TREE_TYPE (ref));
>> + tree niters = build_int_cst (sizetype, AVG_LOOP_NITER);
>> + access_size = fold_build2 (MULT_EXPR, sizetype, niters, access_size);
>> +
>> + do {
>> + tree scev_fn = analyze_scalar_evolution (loop, scev_base);
>> + if (chrec_contains_undetermined (scev_fn)
>> + || chrec_contains_symbols_defined_in_loop (scev_fn, loop->num))
>> + break;
>> ...
>> + strides->safe_push (scev_step);
>> + } while (loop != loop_nest && (loop = loop_outer (loop)) != NULL);
>> +
>>
>> I _think_ you want to do
>>
>> scev_fn = analyze_scalar_evolution (loop, scev_base); // assuming
>> DR_STMT (dr) is in loop
>> scev_fn = instantiate_parameters (nest, scev_fn);
>> if (chrec_contains_undetermined (scev_fn))
>> return; // false?
>>
>> and analyze the result which should be of the form
>>
>> { { { init, +, step1 }_1, +, step2 }_2, + , step3 }_3 ...
>>
>> if canonical. I think estimate_val_by_simplify_replace isn't needed
>> if you do that
>> (it also looks odd to replace all vairables in step by niter...).
> I replied on this in previous message, instantiate_parameters doesn't always
> give canonical form result as expected. The loop here could be seen as a
> local instantiate process, right?
Kind of. I'll see if I can reproduce the difference with any of your
intercahnge
testcases - any hint which one to look at?
> Also estimate_val_by_simplify_replace is needed for pointers, where strides
> are computed from niters of loops which could be non compilation time constant.
> But yes, it's an odd fixup after I failed to do anything better.
But you are for example computing _1 - _2 to zero, right? Because both _1
and _2 are not constant and thus you replace it with the same (symbolical)
constant 'niter'.
I think that asks for garbage-in-garbage-out ...
Which testcase is this important for so I can have a look?
>>
>> I think keeping the chrec in the above form is also more suitable for what
>> the caller does so the outermost loop is simply
>>
>> loop = loop_nest;
>> loop-over-all-dr-chrecs
>> if (flow_loop_nested_p (loop, CHREC_LOOP (chrec)))
>> loop = CHREC_LOOP (chrec);
>>
>> given the outermost loop is the outer evolution. If you sort the
>> stride vecs from inner
>> to outer you don't need prune_access_strides_not_in_loop.
> Hmm, So stripping outer loops prefer inner to outer sort of strides, but cost computation
> during interchange prefers outer to inner sort because loop_nest in tree-data-ref is sorted
> in this way. Seems a single prune_* function is better than fiddling with cost computation.
Not sure how to interpret your answer... I'll see to have a more
detailed suggestion
after playing with the code a bit.
>>
>> +/* Count and return the number of loops in LOOP_NEST. */
>> +
>> +unsigned int
>> +num_loops_in_loop_nest (struct loop *loop_nest)
>> +{
>> + unsigned num_loops;
>> + for (num_loops = 0; loop_nest; num_loops++, loop_nest = loop_nest->inner)
>> + ;
>> + return num_loops;
>>
>> loop_depth (innermost) - loop_depth (nest)?
> Done.
>
>>
>> +static bool
>> +should_interchange_loops (unsigned i_idx, unsigned o_idx,
>> + vec<data_reference_p> datarefs,
>> + bool innermost_loops_p, bool dump_info_p = true)
>> +{
>>
>> isn't all we need associating the above CHREC to sort after the CHREC_RIGHTs
>> and figure a permutation sequence to arrive there? That is for the local
>> decision you compute here it is CHREC_RIGHT [i_idx] > CHREC_RIGHT [o_idx]
>> when we should interchange?
>>
>> That subloop_stride_p and tracking invariant DRs looks a bit odd. For loops
>> where a DR is invariant you simply do not have an evolution in that loop.
>> You seem to simply add strides in the inner and outer loops for each DR,
>> can you explain how that works? Also I guess strides bigger than the
>> various cache-line size parameters should be treated 'equal'? That is,
>> if we don't get any DR to a step that results in L1 cache hits because
>> two DRs share a cache line the interchange is pointless?
> So given below loop:
>
> for (int i = 0; i < M; i++)
> for (int j = 0; j < M; j++)
> for (int k = 0; k < M; k++)
> a[k][0][i] = b[k][0][i]
>
> We check if memory reference is invariant wrto a loop only if it has zero strides within
> current loop nest. In this example, there is no invariant given address changes in the
> innermost loop.
But they simply wouldn't take part in the sorting? That is, invariant
refs in a loop
shouldn't prevent it becoming more inner or more outer, no?
> For strides bigger than cache-line size, it's also possible the interchange is wanted, as
> in below example:
>
> for (int i = 0; i < M; i++) //loop 1
> for (int j = 0; j < M; j++) //loop 2
> for (int k = 0; k < M; k++) //loop 3
> a[j][i][k] = b[j][i][k]
>
> Strides for loop 1/2 are very like to be big, but after interchange, we will have stream
> access of both arrays.
>
> More advanced heuristics may be possible, but so far the estimation is quite good by
> checking all interchanges I looked into.
>
>>
>> +/* Prune DATAREFS by removing any data reference not inside of LOOP. */
>> +
>> +static inline void
>> +prune_datarefs_not_in_loop (struct loop *loop, vec<data_reference_p> datarefs)
>> +{
>> + struct data_reference *dr;
>> +
>> + for (unsigned i = 0; datarefs.iterate (i, &dr);)
>> + if (flow_bb_inside_loop_p (loop, gimple_bb (DR_STMT (dr))))
>> + i++;
>> + else
>> + {
>> + datarefs.ordered_remove (i);
>>
>> that's expensive. It's better to keep moving DRs we want to keep
>> when walking the array. That is, add a j you increment only when
>> we keep a DR, moving *i to *j.
> Done.
>
>>
>> + if (dr->aux)
>> + {
>> + DR_ACCESS_STRIDE (dr)->release ();
>> + free (dr->aux);
>> + }
>> + free_data_ref (dr);
>> + }
>> +}
>>
>> +
>> + start_loop = prune_non_rectangle_loop_nest (innermost_loop, start_loop);
>> +
>>
>> Hmm. If you instantiate the SCEV for the niters for each loop in the nest
>> wrt the nest you can figure if it has any evolution in sth else than the
>> loop (evolution_function_is_univariate_p). That is, this is not a problem
>> until you arrive at analyzing DR strides, right? At which point you
>> can check for the appropriate form?
> Hmm, not really. The niter relation may not appear in SCEV of reference addr.
> For example, below loop:
>
> for (int i = 0; i < M; i++) //loop 1
> for (int j = 0; j < M; j++) //loop 2
> for (int k = 0; k < i; k++) //loop 3
> a[k][0][i] = b[k][0][i]
>
> There is no information in data reference about i/j loops.
> Anyway, I refactored the code and put this check in proper_loop_form_for_interchange.
> Simpler I think.
>
>>
>> + if (find_data_references_in_loop (start_loop, datarefs) == chrec_dont_know
>> + /* Check if there is no data reference. */
>> + || datarefs->length () == 0
>> + /* Check if there are too many of data references. */
>> + || ((int) datarefs->length ()
>> + > PARAM_VALUE (PARAM_LOOP_MAX_DATAREFS_FOR_DATADEPS))
>> + /* Check if there is any data reference in loop latch. We can't handle
>> + loops which loop header and data references have different execution
>> + times. */
>> + || dataref_niters_diff_to_loop_header (*datarefs)
>>
>> this suggests to do your own find_data_references_in_loop so you can early
>> out.
> I refactored the code a bit. Now this check is in proper_loop_form_for_interchange,
> but I do customized my own data references finder. It's needed to strip outer loops
> once a difficult reference is found.
>
>>
>> Overall the flow through the pass is a bit hard to follow given there are
>> IMHO too many functions.
> Yeah, I removed quite number of small functions and refactor the code a lot. Hope this
> version is more straightforward.
>>
>> +unsigned int
>> +pass_linterchange::execute (function *fun)
>> +{
>> + if (number_of_loops (fun) <= 2)
>> + return 0;
>> +
>> + bool changed_p = false;;
>> + struct loop *loop;
>> + vec<loop_p> loop_nest;
>> + vec<data_reference_p> datarefs;
>> + vec<ddr_p> ddrs;
>> +
>> + FOR_EACH_LOOP (loop, LI_ONLY_INNERMOST)
>> + if (prepare_perfect_loop_nest (loop, &loop_nest, &datarefs, &ddrs))
>> + {
>> + tree_loop_interchange loop_interchange (loop_nest, datarefs, ddrs);
>> + changed_p |= loop_interchange.interchange ();
>> + }
>>
>> you leak datarefs/ddrs?
> It was release in destructor, but I refactored it anyway. I will push the code to branch
> gcc.gnu.org/svn/gcc/branches/gimple-linterchange.
>
> Thanks again for the comment of you two.
Digging into the code now...
Richard.
> Thanks,
> bin
> 2017-11-27 Bin Cheng <bin.cheng@arm.com>
>
> * Makefile.in (gimple-loop-interchange.o): New object file.
> * common.opt (floop-interchange): Reuse the option from graphite.
> * doc/invoke.texi (-floop-interchange): Ditto. New document.
> * gimple-loop-interchange.cc: New file.
> * params.def (PARAM_LOOP_INTERCHANGE_MAX_NUM_STMTS): New parameter.
> (PARAM_LOOP_INTERCHANGE_STRIDE_RATIO): New parameter.
> * passes.def (pass_linterchange): New pass.
> * timevar.def (TV_LINTERCHANGE): New time var.
> * tree-pass.h (make_pass_linterchange): New declaration.
> * tree-ssa-loop-ivcanon.c (create_canonical_iv): Change to external
> interchange. Record IV before/after increment in new parameters.
> * tree-ssa-loop-ivopts.h (create_canonical_iv): New declaration.
>
> gcc/testsuite
> 2017-11-27 Bin Cheng <bin.cheng@arm.com>
>
> * gcc.dg/tree-ssa/loop-interchange-1.c: New test.
> * gcc.dg/tree-ssa/loop-interchange-2.c: New test.
> * gcc.dg/tree-ssa/loop-interchange-3.c: New test.
> * gcc.dg/tree-ssa/loop-interchange-4.c: New test.
> * gcc.dg/tree-ssa/loop-interchange-5.c: New test.
> * gcc.dg/tree-ssa/loop-interchange-6.c: New test.
> * gcc.dg/tree-ssa/loop-interchange-7.c: New test.
> * gcc.dg/tree-ssa/loop-interchange-8.c: New test.
> * gcc.dg/tree-ssa/loop-interchange-9.c: New test.
> * gcc.dg/tree-ssa/loop-interchange-10.c: New test.
> * gcc.dg/tree-ssa/loop-interchange-11.c: New test.
next prev parent reply other threads:[~2017-11-30 13:01 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-11-28 15:55 Bin Cheng
2017-11-28 16:14 ` David Malcolm
2017-11-28 16:16 ` Bin.Cheng
2017-11-30 13:19 ` Richard Biener [this message]
2017-11-30 14:34 ` Richard Biener
2017-11-30 15:07 ` Bin.Cheng
2017-11-30 15:14 ` Richard Biener
2017-11-30 16:01 ` Richard Biener
2017-11-30 18:24 ` Bin.Cheng
2017-11-30 18:41 ` Richard Biener
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CAFiYyc12RH3VLjs3vxjcSwWA5CKXWOzbU2Bf5eBh5dzEEaeTPg@mail.gmail.com \
--to=richard.guenther@gmail.com \
--cc=Bin.Cheng@arm.com \
--cc=gcc-patches@gcc.gnu.org \
--cc=matz@suse.de \
--cc=nd@arm.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).