On Mon, Feb 20, 2017 at 4:05 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> BTW, if we use gcov_type in calculation from expected_loop_iterations_unbounded,
>> how should we adjust the number for calling scale_loop_frequencies
>> which has int type?  In extreme case, gcov_type could be out of int's
>> range, we have to cap the value anyway.  But yes, 10000 in
>> expect_loop_iterations is too small.
>
> What I usually do is to use fixed point math in this case (based on REG_BR_PROB_BASE).
> Just pass REG_BR_PROB_BASE as den and calculate the adjustment in gcov_type converting
> to int. Because you should be just decreasing the counts, it won't overflow and because
> the decarese will be in range, say 2...256 times, it should also be sufficiently
> precise.
>
> Be careful to avoid overflow of gcov type - it is not safe to multiply two
> counts in 64bit math because each count can be more than 2^32.  (next stage1 I
> plan to switch most of this to sreals that will make this easier)
>
>> >> > But I guess here it is sort of safe because vectorized loops are simple.
>> >> > You can't just scale down the existing counts/frequencies by vf, because the
>> >> > entry edge frequency was adjusted.
>> >> I am not 100% follow here, it looks the code avoids changing frequency
>> >> counter for preheader/exit edge, otherwise we would need to change all
>> >> counters dominated by them?
>> >
>> > I was just wondering what is wrong with taking the existing frequencies/counts
>> > the loop body has and dividing them all by the unroll factor.  This is correct
>> > if you ignore the versioning. With versioning I guess you want to scale by
>> > the unroll factor and also subtract frequencies/counts that was acocunted to
>> > the other versions of the loop, right?
>> IIUC, for (vectorized) loop header, it's frequency is calculated by:
>>           freq_header = freq_preheader + freq_latch
>> and freq_latch = (niter * freq_preheader).  Simply scaling it by VF gives:
>>           freq_header = (freq_preheader + freq_latch) / VF
>> which is wrong.  Especially if the loop is vectorized by large VF
>> (=16) and we normally assume niter (=10) without profiling
>> information, it results not only mismatch, but also
>> (loop->header->frequency < loop->preheader->frequency).  In fact, if
>> we have accurate niter information, the counter should be:
>>           freq_header = freq_preheader + niter * freq_preheader
>
> You are right. We need to compensate for the change of probability of the loop
> exit edge.
>>
>> >> >
>> >> > Also niter_for_unrolled_loop depends on sanity of the profile, so perhaps you
>> >> > need to compute it before you start chanigng the CFG by peeling proplogue?
>> >> Peeling for prologue doesn't change profiling information of
>> >> vect_loop, it is the skip edge from before loop to preferred epilogue
>> >> loop that will change profile counters.  I guess here exists a dilemma
>> >> that niter_for_unrolled_loop is for loop after peeling for prologue?
>> >
>> > expected_loop_iterations_unbounded calculates number of iteations by computing
>> > sum of frequencies of edges entering the loop and comparing it to the frequency
>> > of loop header.  While peeling the prologue, you split the preheader edge and
>> > adjust frequency of the new preheader BB of the loop to be vectorized.  I think
>> > that will adjust the #of iterations estimate.
>> It's not the case now I think.  one motivation of new vect_do_peeling
>> is to avoid niter checks between prologue and vector loop.  Once
>> prologue loop is entered or checked, the vector loop must be executed
>> unconditionally.  So the preheaderof vector loop has consistent
>> frequency counters now.  The niter check on whether vector loop should
>> be executed is now merged with cost check before prologue, and in the
>> future I think this can be further merged if loop versioning is
>> needed.
>
> Originally you have
>
>   loop_preheader
>        |
>        v
>    loop_header
>
> and the ratio of the two BB frequencies is the loop iteration count. Then you
> do something like:
>
>   orig_loop_preheader
>        |
>        v
>    loop_prologue -----> scalar_version_of_loop
>        |
>        v
>  new_loop_preheader
>        |
>        v
>    loop_header
>
> At some point, you need to update new_loop_preheader frequency/count
> to reflect the fact that with some probability the loop_prologue avoids
> the vectorized loop.  Once you do it and if you don't scale frequency of
> loop_header you will make expect_loop_iterations to return higher value
> than previously.
>
> So at the time you are calling it, you need to be sure that the loop_header
> and its preheader frequences was both adjusted by same factor.  Or you need
> to call it early before you start hacking on the CFG and its profile.
>
> Pehaps currently it is safe, because your peeling code is also scaling
> the loop profiles.
Hi Honza,
Attachment is the updated patch.  Bootstrap and test ongoing, any comments?

Thanks,
bin

2017-02-21  Bin Cheng  <bin.cheng@arm.com>

PR tree-optimization/77536
* tree-ssa-loop-manip.c (niter_for_unrolled_loop): New function.
(tree_transform_and_unroll_loop): Use above function to compute the
estimated niter of unrolled loop and use it when scaling profile.
* tree-ssa-loop-manip.h niter_for_unrolled_loop(): New declaration.
* tree-vect-loop.c (scale_profile_for_vect_loop): New function.
(vect_transform_loop): Call above function.

gcc/testsuite/ChangeLog
2017-02-21  Bin Cheng  <bin.cheng@arm.com>

PR tree-optimization/77536
* gcc.dg/vect/pr79347.c: Revise testing string.