On Mon, Feb 20, 2017 at 4:05 PM, Jan Hubicka wrote: >> BTW, if we use gcov_type in calculation from expected_loop_iterations_unbounded, >> how should we adjust the number for calling scale_loop_frequencies >> which has int type? In extreme case, gcov_type could be out of int's >> range, we have to cap the value anyway. But yes, 10000 in >> expect_loop_iterations is too small. > > What I usually do is to use fixed point math in this case (based on REG_BR_PROB_BASE). > Just pass REG_BR_PROB_BASE as den and calculate the adjustment in gcov_type converting > to int. Because you should be just decreasing the counts, it won't overflow and because > the decarese will be in range, say 2...256 times, it should also be sufficiently > precise. > > Be careful to avoid overflow of gcov type - it is not safe to multiply two > counts in 64bit math because each count can be more than 2^32. (next stage1 I > plan to switch most of this to sreals that will make this easier) > >> >> > But I guess here it is sort of safe because vectorized loops are simple. >> >> > You can't just scale down the existing counts/frequencies by vf, because the >> >> > entry edge frequency was adjusted. >> >> I am not 100% follow here, it looks the code avoids changing frequency >> >> counter for preheader/exit edge, otherwise we would need to change all >> >> counters dominated by them? >> > >> > I was just wondering what is wrong with taking the existing frequencies/counts >> > the loop body has and dividing them all by the unroll factor. This is correct >> > if you ignore the versioning. With versioning I guess you want to scale by >> > the unroll factor and also subtract frequencies/counts that was acocunted to >> > the other versions of the loop, right? >> IIUC, for (vectorized) loop header, it's frequency is calculated by: >> freq_header = freq_preheader + freq_latch >> and freq_latch = (niter * freq_preheader). Simply scaling it by VF gives: >> freq_header = (freq_preheader + freq_latch) / VF >> which is wrong. Especially if the loop is vectorized by large VF >> (=16) and we normally assume niter (=10) without profiling >> information, it results not only mismatch, but also >> (loop->header->frequency < loop->preheader->frequency). In fact, if >> we have accurate niter information, the counter should be: >> freq_header = freq_preheader + niter * freq_preheader > > You are right. We need to compensate for the change of probability of the loop > exit edge. >> >> >> > >> >> > Also niter_for_unrolled_loop depends on sanity of the profile, so perhaps you >> >> > need to compute it before you start chanigng the CFG by peeling proplogue? >> >> Peeling for prologue doesn't change profiling information of >> >> vect_loop, it is the skip edge from before loop to preferred epilogue >> >> loop that will change profile counters. I guess here exists a dilemma >> >> that niter_for_unrolled_loop is for loop after peeling for prologue? >> > >> > expected_loop_iterations_unbounded calculates number of iteations by computing >> > sum of frequencies of edges entering the loop and comparing it to the frequency >> > of loop header. While peeling the prologue, you split the preheader edge and >> > adjust frequency of the new preheader BB of the loop to be vectorized. I think >> > that will adjust the #of iterations estimate. >> It's not the case now I think. one motivation of new vect_do_peeling >> is to avoid niter checks between prologue and vector loop. Once >> prologue loop is entered or checked, the vector loop must be executed >> unconditionally. So the preheaderof vector loop has consistent >> frequency counters now. The niter check on whether vector loop should >> be executed is now merged with cost check before prologue, and in the >> future I think this can be further merged if loop versioning is >> needed. > > Originally you have > > loop_preheader > | > v > loop_header > > and the ratio of the two BB frequencies is the loop iteration count. Then you > do something like: > > orig_loop_preheader > | > v > loop_prologue -----> scalar_version_of_loop > | > v > new_loop_preheader > | > v > loop_header > > At some point, you need to update new_loop_preheader frequency/count > to reflect the fact that with some probability the loop_prologue avoids > the vectorized loop. Once you do it and if you don't scale frequency of > loop_header you will make expect_loop_iterations to return higher value > than previously. > > So at the time you are calling it, you need to be sure that the loop_header > and its preheader frequences was both adjusted by same factor. Or you need > to call it early before you start hacking on the CFG and its profile. > > Pehaps currently it is safe, because your peeling code is also scaling > the loop profiles. Hi Honza, Attachment is the updated patch. Bootstrap and test ongoing, any comments? Thanks, bin 2017-02-21 Bin Cheng PR tree-optimization/77536 * tree-ssa-loop-manip.c (niter_for_unrolled_loop): New function. (tree_transform_and_unroll_loop): Use above function to compute the estimated niter of unrolled loop and use it when scaling profile. * tree-ssa-loop-manip.h niter_for_unrolled_loop(): New declaration. * tree-vect-loop.c (scale_profile_for_vect_loop): New function. (vect_transform_loop): Call above function. gcc/testsuite/ChangeLog 2017-02-21 Bin Cheng PR tree-optimization/77536 * gcc.dg/vect/pr79347.c: Revise testing string.