From: Richard Biener <richard.guenther@gmail.com>
To: "Bin.Cheng" <amker.cheng@gmail.com>
Cc: Bin Cheng <Bin.Cheng@arm.com>,
"gcc-patches@gcc.gnu.org" <gcc-patches@gcc.gnu.org>,
nd <nd@arm.com>, "matz@suse.de" <matz@suse.de>
Subject: Re: [PATCH GCC][V2]A simple implementation of loop interchange
Date: Thu, 30 Nov 2017 15:14:00 -0000 [thread overview]
Message-ID: <CAFiYyc03QDJa6-fAbCHd_axQw0886ZEVMO1LevjVhgE4sG=ovg@mail.gmail.com> (raw)
In-Reply-To: <CAHFci2-9jhkHNRAsjcNymVMELS-cMN8j1OX9Pv8xsfBhEpA7MQ@mail.gmail.com>
On Thu, Nov 30, 2017 at 3:13 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
> On Thu, Nov 30, 2017 at 1:01 PM, Richard Biener
> <richard.guenther@gmail.com> wrote:
>> On Tue, Nov 28, 2017 at 4:26 PM, Bin Cheng <Bin.Cheng@arm.com> wrote:
>>> Hi,
>>> This is updated patch with review comments resolved. Some explanation embedded below.
>>>
>>>> +
>>>> + iloop->nb_iterations = nb_iterations;
>>>>
>>>> use std::swap? Also I think if you can keep nb_iterations you
>>>> can certainly keep the upper bounds. You're probably
>>>> afraid of the ->stmt references in the nb_iter_bound entries?
>>>>
>>>> Anyway, either scrap everything or try to keep everything.
>>> Yeah, not only the stmts, but also the control_iv information because the SCEV
>>> information may be corrupted during code transformation.
>>> Now I discarded all the information.
>>
>> Note that given you interchange the loops but not the CFG or the loop structures
>> you might want to swap loop->num and flags like ->force_vectorize. That is,
>> essentially change the ->header/latch association (and other CFG related stuff
>> like recorded exits).
>>
>> It might also be we want to / need to disable interchange for, say,
>> ->force_vectorize
>> inner loops or ->unroll != 0? Or we need to clear them, maybe
>> optionally diagnosing
>> that fact.
>>
>> At least we need to think about what it means to preserve loop
>> structure (semantically,
>> loop->num should maintain association to the same source-level loop
>> throughout the
>> compilation) for transforms like interchange.
>>
>>>>
>>>> + for (i = 0; oloop.reductions.iterate (i, &re); ++i)
>>>> + {
>>>> + if (re->type != DOUBLE_RTYPE)
>>>> + gcc_unreachable ();
>>>> +
>>>> + use_operand_p use_p;
>>>> + imm_use_iterator iterator;
>>>> + FOR_EACH_IMM_USE_FAST (use_p, iterator, re->var)
>>>> + mark_or_remove_dbg_stmt (USE_STMT (use_p), re->var);
>>>> + FOR_EACH_IMM_USE_FAST (use_p, iterator, re->next)
>>>> + mark_or_remove_dbg_stmt (USE_STMT (use_p), re->next);
>>>> + if (TREE_CODE (re->init) == SSA_NAME)
>>>> + {
>>>> + FOR_EACH_IMM_USE_FAST (use_p, iterator, re->init)
>>>> + mark_or_remove_dbg_stmt (USE_STMT (use_p), re->init);
>>>> + }
>>>>
>>>> can you add a comment what you are doing here?
>>>>
>>>> Note that other loop opts simply scrap all debug stmts ...
>>> As mentioned above, updated patch doesn't try hard to maintain debug use info any more.
>>>
>>>>
>>>> +static void
>>>> +compute_access_stride (struct loop *loop_nest, struct loop *loop,
>>>> + data_reference_p dr)
>>>> +{
>>>> ...
>>>> + tree ref = DR_REF (dr);
>>>> + tree scev_base = build_fold_addr_expr (ref);
>>>> + tree access_size = TYPE_SIZE_UNIT (TREE_TYPE (ref));
>>>> + tree niters = build_int_cst (sizetype, AVG_LOOP_NITER);
>>>> + access_size = fold_build2 (MULT_EXPR, sizetype, niters, access_size);
>>>> +
>>>> + do {
>>>> + tree scev_fn = analyze_scalar_evolution (loop, scev_base);
>>>> + if (chrec_contains_undetermined (scev_fn)
>>>> + || chrec_contains_symbols_defined_in_loop (scev_fn, loop->num))
>>>> + break;
>>>> ...
>>>> + strides->safe_push (scev_step);
>>>> + } while (loop != loop_nest && (loop = loop_outer (loop)) != NULL);
>>>> +
>>>>
>>>> I _think_ you want to do
>>>>
>>>> scev_fn = analyze_scalar_evolution (loop, scev_base); // assuming
>>>> DR_STMT (dr) is in loop
>>>> scev_fn = instantiate_parameters (nest, scev_fn);
>>>> if (chrec_contains_undetermined (scev_fn))
>>>> return; // false?
>>>>
>>>> and analyze the result which should be of the form
>>>>
>>>> { { { init, +, step1 }_1, +, step2 }_2, + , step3 }_3 ...
>>>>
>>>> if canonical. I think estimate_val_by_simplify_replace isn't needed
>>>> if you do that
>>>> (it also looks odd to replace all vairables in step by niter...).
>>> I replied on this in previous message, instantiate_parameters doesn't always
>>> give canonical form result as expected. The loop here could be seen as a
>>> local instantiate process, right?
>>
>> Kind of. I'll see if I can reproduce the difference with any of your
>> intercahnge
>> testcases - any hint which one to look at?
> For added tests, I think there will be no difference between the two.
> I noticed the difference for
> pointer cases like:
> for (i...)
> for (j...)
> for (k...)
> p[i*n*n + j*n+ k] =...
>
>>
>>> Also estimate_val_by_simplify_replace is needed for pointers, where strides
>>> are computed from niters of loops which could be non compilation time constant.
>>> But yes, it's an odd fixup after I failed to do anything better.
>>
>> But you are for example computing _1 - _2 to zero, right? Because both _1
>> and _2 are not constant and thus you replace it with the same (symbolical)
>> constant 'niter'.
>>
>> I think that asks for garbage-in-garbage-out ...
>>
>> Which testcase is this important for so I can have a look?
> So far this is only for the above pointer case. Actually I don't
> think it's that important, and thought about skip it.
> So we don't have to do estimate_val_by_simplify_replace.
Ok, I'd like to "dumb" the pass down to the level we can solve the
bwave case (which I realize is already one of the more complicated ones).
Just because it's already late for GCC 8.
Richard.
> Thanks,
> bin
>>
>>>>
>>>> I think keeping the chrec in the above form is also more suitable for what
>>>> the caller does so the outermost loop is simply
>>>>
>>>> loop = loop_nest;
>>>> loop-over-all-dr-chrecs
>>>> if (flow_loop_nested_p (loop, CHREC_LOOP (chrec)))
>>>> loop = CHREC_LOOP (chrec);
>>>>
>>>> given the outermost loop is the outer evolution. If you sort the
>>>> stride vecs from inner
>>>> to outer you don't need prune_access_strides_not_in_loop.
>>> Hmm, So stripping outer loops prefer inner to outer sort of strides, but cost computation
>>> during interchange prefers outer to inner sort because loop_nest in tree-data-ref is sorted
>>> in this way. Seems a single prune_* function is better than fiddling with cost computation.
>>
>> Not sure how to interpret your answer... I'll see to have a more
>> detailed suggestion
>> after playing with the code a bit.
>>
>>>>
>>>> +/* Count and return the number of loops in LOOP_NEST. */
>>>> +
>>>> +unsigned int
>>>> +num_loops_in_loop_nest (struct loop *loop_nest)
>>>> +{
>>>> + unsigned num_loops;
>>>> + for (num_loops = 0; loop_nest; num_loops++, loop_nest = loop_nest->inner)
>>>> + ;
>>>> + return num_loops;
>>>>
>>>> loop_depth (innermost) - loop_depth (nest)?
>>> Done.
>>>
>>>>
>>>> +static bool
>>>> +should_interchange_loops (unsigned i_idx, unsigned o_idx,
>>>> + vec<data_reference_p> datarefs,
>>>> + bool innermost_loops_p, bool dump_info_p = true)
>>>> +{
>>>>
>>>> isn't all we need associating the above CHREC to sort after the CHREC_RIGHTs
>>>> and figure a permutation sequence to arrive there? That is for the local
>>>> decision you compute here it is CHREC_RIGHT [i_idx] > CHREC_RIGHT [o_idx]
>>>> when we should interchange?
>>>>
>>>> That subloop_stride_p and tracking invariant DRs looks a bit odd. For loops
>>>> where a DR is invariant you simply do not have an evolution in that loop.
>>>> You seem to simply add strides in the inner and outer loops for each DR,
>>>> can you explain how that works? Also I guess strides bigger than the
>>>> various cache-line size parameters should be treated 'equal'? That is,
>>>> if we don't get any DR to a step that results in L1 cache hits because
>>>> two DRs share a cache line the interchange is pointless?
>>> So given below loop:
>>>
>>> for (int i = 0; i < M; i++)
>>> for (int j = 0; j < M; j++)
>>> for (int k = 0; k < M; k++)
>>> a[k][0][i] = b[k][0][i]
>>>
>>> We check if memory reference is invariant wrto a loop only if it has zero strides within
>>> current loop nest. In this example, there is no invariant given address changes in the
>>> innermost loop.
>>
>> But they simply wouldn't take part in the sorting? That is, invariant
>> refs in a loop
>> shouldn't prevent it becoming more inner or more outer, no?
>>
>>> For strides bigger than cache-line size, it's also possible the interchange is wanted, as
>>> in below example:
>>>
>>> for (int i = 0; i < M; i++) //loop 1
>>> for (int j = 0; j < M; j++) //loop 2
>>> for (int k = 0; k < M; k++) //loop 3
>>> a[j][i][k] = b[j][i][k]
>>>
>>> Strides for loop 1/2 are very like to be big, but after interchange, we will have stream
>>> access of both arrays.
>>>
>>> More advanced heuristics may be possible, but so far the estimation is quite good by
>>> checking all interchanges I looked into.
>>>
>>>>
>>>> +/* Prune DATAREFS by removing any data reference not inside of LOOP. */
>>>> +
>>>> +static inline void
>>>> +prune_datarefs_not_in_loop (struct loop *loop, vec<data_reference_p> datarefs)
>>>> +{
>>>> + struct data_reference *dr;
>>>> +
>>>> + for (unsigned i = 0; datarefs.iterate (i, &dr);)
>>>> + if (flow_bb_inside_loop_p (loop, gimple_bb (DR_STMT (dr))))
>>>> + i++;
>>>> + else
>>>> + {
>>>> + datarefs.ordered_remove (i);
>>>>
>>>> that's expensive. It's better to keep moving DRs we want to keep
>>>> when walking the array. That is, add a j you increment only when
>>>> we keep a DR, moving *i to *j.
>>> Done.
>>>
>>>>
>>>> + if (dr->aux)
>>>> + {
>>>> + DR_ACCESS_STRIDE (dr)->release ();
>>>> + free (dr->aux);
>>>> + }
>>>> + free_data_ref (dr);
>>>> + }
>>>> +}
>>>>
>>>> +
>>>> + start_loop = prune_non_rectangle_loop_nest (innermost_loop, start_loop);
>>>> +
>>>>
>>>> Hmm. If you instantiate the SCEV for the niters for each loop in the nest
>>>> wrt the nest you can figure if it has any evolution in sth else than the
>>>> loop (evolution_function_is_univariate_p). That is, this is not a problem
>>>> until you arrive at analyzing DR strides, right? At which point you
>>>> can check for the appropriate form?
>>> Hmm, not really. The niter relation may not appear in SCEV of reference addr.
>>> For example, below loop:
>>>
>>> for (int i = 0; i < M; i++) //loop 1
>>> for (int j = 0; j < M; j++) //loop 2
>>> for (int k = 0; k < i; k++) //loop 3
>>> a[k][0][i] = b[k][0][i]
>>>
>>> There is no information in data reference about i/j loops.
>>> Anyway, I refactored the code and put this check in proper_loop_form_for_interchange.
>>> Simpler I think.
>>>
>>>>
>>>> + if (find_data_references_in_loop (start_loop, datarefs) == chrec_dont_know
>>>> + /* Check if there is no data reference. */
>>>> + || datarefs->length () == 0
>>>> + /* Check if there are too many of data references. */
>>>> + || ((int) datarefs->length ()
>>>> + > PARAM_VALUE (PARAM_LOOP_MAX_DATAREFS_FOR_DATADEPS))
>>>> + /* Check if there is any data reference in loop latch. We can't handle
>>>> + loops which loop header and data references have different execution
>>>> + times. */
>>>> + || dataref_niters_diff_to_loop_header (*datarefs)
>>>>
>>>> this suggests to do your own find_data_references_in_loop so you can early
>>>> out.
>>> I refactored the code a bit. Now this check is in proper_loop_form_for_interchange,
>>> but I do customized my own data references finder. It's needed to strip outer loops
>>> once a difficult reference is found.
>>>
>>>>
>>>> Overall the flow through the pass is a bit hard to follow given there are
>>>> IMHO too many functions.
>>> Yeah, I removed quite number of small functions and refactor the code a lot. Hope this
>>> version is more straightforward.
>>>>
>>>> +unsigned int
>>>> +pass_linterchange::execute (function *fun)
>>>> +{
>>>> + if (number_of_loops (fun) <= 2)
>>>> + return 0;
>>>> +
>>>> + bool changed_p = false;;
>>>> + struct loop *loop;
>>>> + vec<loop_p> loop_nest;
>>>> + vec<data_reference_p> datarefs;
>>>> + vec<ddr_p> ddrs;
>>>> +
>>>> + FOR_EACH_LOOP (loop, LI_ONLY_INNERMOST)
>>>> + if (prepare_perfect_loop_nest (loop, &loop_nest, &datarefs, &ddrs))
>>>> + {
>>>> + tree_loop_interchange loop_interchange (loop_nest, datarefs, ddrs);
>>>> + changed_p |= loop_interchange.interchange ();
>>>> + }
>>>>
>>>> you leak datarefs/ddrs?
>>> It was release in destructor, but I refactored it anyway. I will push the code to branch
>>> gcc.gnu.org/svn/gcc/branches/gimple-linterchange.
>>>
>>> Thanks again for the comment of you two.
>>
>> Digging into the code now...
>>
>> Richard.
>>
>>> Thanks,
>>> bin
>>> 2017-11-27 Bin Cheng <bin.cheng@arm.com>
>>>
>>> * Makefile.in (gimple-loop-interchange.o): New object file.
>>> * common.opt (floop-interchange): Reuse the option from graphite.
>>> * doc/invoke.texi (-floop-interchange): Ditto. New document.
>>> * gimple-loop-interchange.cc: New file.
>>> * params.def (PARAM_LOOP_INTERCHANGE_MAX_NUM_STMTS): New parameter.
>>> (PARAM_LOOP_INTERCHANGE_STRIDE_RATIO): New parameter.
>>> * passes.def (pass_linterchange): New pass.
>>> * timevar.def (TV_LINTERCHANGE): New time var.
>>> * tree-pass.h (make_pass_linterchange): New declaration.
>>> * tree-ssa-loop-ivcanon.c (create_canonical_iv): Change to external
>>> interchange. Record IV before/after increment in new parameters.
>>> * tree-ssa-loop-ivopts.h (create_canonical_iv): New declaration.
>>>
>>> gcc/testsuite
>>> 2017-11-27 Bin Cheng <bin.cheng@arm.com>
>>>
>>> * gcc.dg/tree-ssa/loop-interchange-1.c: New test.
>>> * gcc.dg/tree-ssa/loop-interchange-2.c: New test.
>>> * gcc.dg/tree-ssa/loop-interchange-3.c: New test.
>>> * gcc.dg/tree-ssa/loop-interchange-4.c: New test.
>>> * gcc.dg/tree-ssa/loop-interchange-5.c: New test.
>>> * gcc.dg/tree-ssa/loop-interchange-6.c: New test.
>>> * gcc.dg/tree-ssa/loop-interchange-7.c: New test.
>>> * gcc.dg/tree-ssa/loop-interchange-8.c: New test.
>>> * gcc.dg/tree-ssa/loop-interchange-9.c: New test.
>>> * gcc.dg/tree-ssa/loop-interchange-10.c: New test.
>>> * gcc.dg/tree-ssa/loop-interchange-11.c: New test.
next prev parent reply other threads:[~2017-11-30 15:09 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-11-28 15:55 Bin Cheng
2017-11-28 16:14 ` David Malcolm
2017-11-28 16:16 ` Bin.Cheng
2017-11-30 13:19 ` Richard Biener
2017-11-30 14:34 ` Richard Biener
2017-11-30 15:07 ` Bin.Cheng
2017-11-30 15:14 ` Richard Biener [this message]
2017-11-30 16:01 ` Richard Biener
2017-11-30 18:24 ` Bin.Cheng
2017-11-30 18:41 ` Richard Biener
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='CAFiYyc03QDJa6-fAbCHd_axQw0886ZEVMO1LevjVhgE4sG=ovg@mail.gmail.com' \
--to=richard.guenther@gmail.com \
--cc=Bin.Cheng@arm.com \
--cc=amker.cheng@gmail.com \
--cc=gcc-patches@gcc.gnu.org \
--cc=matz@suse.de \
--cc=nd@arm.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).