[PATCH PR77536]Generate correct profiling information for vectorized loop

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* [PATCH PR77536]Generate correct profiling information for vectorized loop
@ 2017-02-16 18:38 Bin Cheng
  2017-02-17  1:39 ` Pat Haugen
  2017-02-20 12:54 ` Richard Biener
  0 siblings, 2 replies; 13+ messages in thread
From: Bin Cheng @ 2017-02-16 18:38 UTC (permalink / raw)
  To: gcc-patches; +Cc: nd, pthaugen

[-- Attachment #1: Type: text/plain, Size: 1447 bytes --]

Hi,
After fixing PR79347, the main problem with vectorizer is we scale down profiling counters
for vect_loop by VF, regardless of context CFG's profiling information.  This is wrong and
sometimes makes vect_loop not hot code, which may confuses following optimizers.  This
patch generates well-formed profiling information as generic tree unroller does.  It borrows
code from tree-ssa-loop-manip.c and does small refactors to existing code.  Vectorizer will
not introduce mismatch counters with it, and unroll and vrp2 are the two major passes
messing up profiling counters now.  Though it removes all mismatch in vectorizer, I am not
sure if this counts as a regression fix.
BTW, it may also help PR78116?  Hi Pat, could you please help verify this?  Thanks!

Bootstrap and test on x86_64 and AArch64 ongoing.  Is it OK if no failures?

Thanks,
bin

2017-02-16  Bin Cheng  <bin.cheng@arm.com>

	PR tree-optimization/77536
	* tree-ssa-loop-manip.c (niter_for_unrolled_loop): New function.
	(tree_transform_and_unroll_loop): Use above function to compute the
	estimated niter of unrolled loop.
	* tree-ssa-loop-manip.h niter_for_unrolled_loop(): New declaration.
	* tree-vect-loop.c (scale_profile_for_vect_loop): New function.
	(vect_transform_loop): Call above function.

gcc/testsuite/ChangeLog
2017-02-16  Bin Cheng  <bin.cheng@arm.com>

	PR tree-optimization/77536
	* gcc.dg/vect/pr79347.c: Revise testing string.

[-- Attachment #2: pr77536-20170216.txt --]
[-- Type: text/plain, Size: 6580 bytes --]

diff --git a/gcc/testsuite/gcc.dg/vect/pr79347.c b/gcc/testsuite/gcc.dg/vect/pr79347.c
index 586c638..6825420 100644
--- a/gcc/testsuite/gcc.dg/vect/pr79347.c
+++ b/gcc/testsuite/gcc.dg/vect/pr79347.c
@@ -10,4 +10,4 @@ void n(void)
     a[i]++;
 }
 
-/* { dg-final { scan-tree-dump-times "Invalid sum of " 2 "vect" } } */
+/* { dg-final { scan-tree-dump-not "Invalid sum of " "vect" } } */
diff --git a/gcc/tree-ssa-loop-manip.c b/gcc/tree-ssa-loop-manip.c
index 43df29c..6d868c5 100644
--- a/gcc/tree-ssa-loop-manip.c
+++ b/gcc/tree-ssa-loop-manip.c
@@ -1093,6 +1093,31 @@ scale_dominated_blocks_in_loop (struct loop *loop, basic_block bb,
     }
 }
 
+/* Return estimated niter for LOOP after unrolling by FACTOR times.  */
+
+unsigned
+niter_for_unrolled_loop (struct loop *loop, unsigned factor)
+{
+  unsigned est_niter = expected_loop_iterations (loop);
+  gcc_assert (factor != 0);
+  unsigned new_est_niter = est_niter / factor;
+
+  /* Without profile feedback, loops for which we do not know a better estimate
+     are assumed to roll 10 times.  When we unroll such loop, it appears to
+     roll too little, and it may even seem to be cold.  To avoid this, we
+     ensure that the created loop appears to roll at least 5 times (but at
+     most as many times as before unrolling).  */
+  if (new_est_niter < 5)
+    {
+      if (est_niter < 5)
+	new_est_niter = est_niter;
+      else
+	new_est_niter = 5;
+    }
+
+  return new_est_niter;
+}
+
 /* Unroll LOOP FACTOR times.  DESC describes number of iterations of LOOP.
    EXIT is the exit of the loop to that DESC corresponds.
 
@@ -1170,12 +1195,11 @@ tree_transform_and_unroll_loop (struct loop *loop, unsigned factor,
   gimple_stmt_iterator bsi;
   use_operand_p op;
   bool ok;
-  unsigned est_niter, prob_entry, scale_unrolled, scale_rest, freq_e, freq_h;
-  unsigned new_est_niter, i, prob;
+  unsigned i, prob, prob_entry, scale_unrolled, scale_rest, freq_e, freq_h;
+  unsigned new_est_niter = niter_for_unrolled_loop (loop, factor);
   unsigned irr = loop_preheader_edge (loop)->flags & EDGE_IRREDUCIBLE_LOOP;
   auto_vec<edge> to_remove;
 
-  est_niter = expected_loop_iterations (loop);
   determine_exit_conditions (loop, desc, factor,
 			     &enter_main_cond, &exit_base, &exit_step,
 			     &exit_cmp, &exit_bound);
@@ -1207,22 +1231,6 @@ tree_transform_and_unroll_loop (struct loop *loop, unsigned factor,
   gcc_assert (new_loop != NULL);
   update_ssa (TODO_update_ssa);
 
-  /* Determine the probability of the exit edge of the unrolled loop.  */
-  new_est_niter = est_niter / factor;
-
-  /* Without profile feedback, loops for that we do not know a better estimate
-     are assumed to roll 10 times.  When we unroll such loop, it appears to
-     roll too little, and it may even seem to be cold.  To avoid this, we
-     ensure that the created loop appears to roll at least 5 times (but at
-     most as many times as before unrolling).  */
-  if (new_est_niter < 5)
-    {
-      if (est_niter < 5)
-	new_est_niter = est_niter;
-      else
-	new_est_niter = 5;
-    }
-
   /* Prepare the cfg and update the phi nodes.  Move the loop exit to the
      loop latch (and make its condition dummy, for the moment).  */
   rest = loop_preheader_edge (new_loop)->src;
diff --git a/gcc/tree-ssa-loop-manip.h b/gcc/tree-ssa-loop-manip.h
index 1e7531f..961b38a 100644
--- a/gcc/tree-ssa-loop-manip.h
+++ b/gcc/tree-ssa-loop-manip.h
@@ -48,6 +48,7 @@ extern bool gimple_duplicate_loop_to_header_edge (struct loop *, edge,
 						  int);
 extern bool can_unroll_loop_p (struct loop *loop, unsigned factor,
 			       struct tree_niter_desc *niter);
+extern unsigned niter_for_unrolled_loop (struct loop *, unsigned);
 extern void tree_transform_and_unroll_loop (struct loop *, unsigned,
 					    edge, struct tree_niter_desc *,
 					    transform_callback, void *);
diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
index c5a1627..c311c04 100644
--- a/gcc/tree-vect-loop.c
+++ b/gcc/tree-vect-loop.c
@@ -6718,6 +6718,35 @@ loop_niters_no_overflow (loop_vec_info loop_vinfo)
   return false;
 }
 
+/* Scale profiling counters by estimation for LOOP which is vectorized
+   by factor VF.  */
+
+static void
+scale_profile_for_vect_loop (struct loop *loop, unsigned vf)
+{
+  unsigned freq_h = loop->header->frequency;
+  unsigned freq_e = EDGE_FREQUENCY (loop_preheader_edge (loop));
+  /* Reduce loop iterations by the vectorization factor.  */
+  unsigned new_est_niter = niter_for_unrolled_loop (loop, vf);
+
+  if (freq_h != 0)
+    scale_loop_frequencies (loop, freq_e * (new_est_niter + 1), freq_h);
+
+  basic_block exit_bb = single_pred (loop->latch);
+  edge exit_e = single_exit (loop);
+  exit_e->count = loop_preheader_edge (loop)->count;
+  exit_e->probability = REG_BR_PROB_BASE / (new_est_niter + 1);
+
+  edge exit_l = single_pred_edge (loop->latch);
+  int prob = exit_l->probability;
+  exit_l->probability = REG_BR_PROB_BASE - exit_e->probability;
+  exit_l->count = exit_bb->count - exit_e->count;
+  if (exit_l->count < 0)
+    exit_l->count = 0;
+  if (prob > 0)
+    scale_bbs_frequencies_int (&loop->latch, 1, exit_l->probability, prob);
+}
+
 /* Function vect_transform_loop.
 
    The analysis phase has determined that the loop is vectorizable.
@@ -6743,16 +6772,10 @@ vect_transform_loop (loop_vec_info loop_vinfo)
   bool transform_pattern_stmt = false;
   bool check_profitability = false;
   int th;
-  /* Record number of iterations before we started tampering with the profile. */
-  gcov_type expected_iterations = expected_loop_iterations_unbounded (loop);
 
   if (dump_enabled_p ())
     dump_printf_loc (MSG_NOTE, vect_location, "=== vec_transform_loop ===\n");
 
-  /* If profile is inprecise, we have chance to fix it up.  */
-  if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo))
-    expected_iterations = LOOP_VINFO_INT_NITERS (loop_vinfo);
-
   /* Use the more conservative vectorization threshold.  If the number
      of iterations is constant assume the cost check has been performed
      by our caller.  If the threshold makes all loops profitable that
@@ -7068,9 +7091,8 @@ vect_transform_loop (loop_vec_info loop_vinfo)
 
   slpeel_make_loop_iterate_ntimes (loop, niters_vector);
 
-  /* Reduce loop iterations by the vectorization factor.  */
-  scale_loop_profile (loop, GCOV_COMPUTE_SCALE (1, vf),
-		      expected_iterations / vf);
+  scale_profile_for_vect_loop (loop, vf);
+
   /* The minimum number of iterations performed by the epilogue.  This
      is 1 when peeling for gaps because we always need a final scalar
      iteration.  */

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH PR77536]Generate correct profiling information for vectorized loop
  2017-02-16 18:38 [PATCH PR77536]Generate correct profiling information for vectorized loop Bin Cheng
@ 2017-02-17  1:39 ` Pat Haugen
  2017-02-20 12:54 ` Richard Biener
  1 sibling, 0 replies; 13+ messages in thread
From: Pat Haugen @ 2017-02-17  1:39 UTC (permalink / raw)
  To: Bin Cheng, gcc-patches; +Cc: nd

On 02/16/2017 11:48 AM, Bin Cheng wrote:
> BTW, it may also help PR78116?  Hi Pat, could you please help verify this?  Thanks!

The first testcase in pr78116 no longer contains loads from spill in the
loop even before your patch. When built with your patch, there are four
additional register copies in the loop (vmovaps %zmm2, %zmm14).

As for the second testcase, your patch gets rid of the 12 loads from
spill in the loop.

-Pat

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH PR77536]Generate correct profiling information for vectorized loop
  2017-02-16 18:38 [PATCH PR77536]Generate correct profiling information for vectorized loop Bin Cheng
  2017-02-17  1:39 ` Pat Haugen
@ 2017-02-20 12:54 ` Richard Biener
  2017-02-20 14:21   ` Jan Hubicka
  1 sibling, 1 reply; 13+ messages in thread
From: Richard Biener @ 2017-02-20 12:54 UTC (permalink / raw)
  To: Bin Cheng, Jan Hubicka; +Cc: gcc-patches, nd, pthaugen

On Thu, Feb 16, 2017 at 6:48 PM, Bin Cheng <Bin.Cheng@arm.com> wrote:
> Hi,
> After fixing PR79347, the main problem with vectorizer is we scale down profiling counters
> for vect_loop by VF, regardless of context CFG's profiling information.  This is wrong and
> sometimes makes vect_loop not hot code, which may confuses following optimizers.  This
> patch generates well-formed profiling information as generic tree unroller does.  It borrows
> code from tree-ssa-loop-manip.c and does small refactors to existing code.  Vectorizer will
> not introduce mismatch counters with it, and unroll and vrp2 are the two major passes
> messing up profiling counters now.  Though it removes all mismatch in vectorizer, I am not
> sure if this counts as a regression fix.
> BTW, it may also help PR78116?  Hi Pat, could you please help verify this?  Thanks!
>
> Bootstrap and test on x86_64 and AArch64 ongoing.  Is it OK if no failures?

The patch looks good to me but please wait for Honza to comment.

Thanks,
Richard.

> Thanks,
> bin
>
> 2017-02-16  Bin Cheng  <bin.cheng@arm.com>
>
>         PR tree-optimization/77536
>         * tree-ssa-loop-manip.c (niter_for_unrolled_loop): New function.
>         (tree_transform_and_unroll_loop): Use above function to compute the
>         estimated niter of unrolled loop.
>         * tree-ssa-loop-manip.h niter_for_unrolled_loop(): New declaration.
>         * tree-vect-loop.c (scale_profile_for_vect_loop): New function.
>         (vect_transform_loop): Call above function.
>
> gcc/testsuite/ChangeLog
> 2017-02-16  Bin Cheng  <bin.cheng@arm.com>
>
>         PR tree-optimization/77536
>         * gcc.dg/vect/pr79347.c: Revise testing string.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH PR77536]Generate correct profiling information for vectorized loop
  2017-02-20 12:54 ` Richard Biener
@ 2017-02-20 14:21   ` Jan Hubicka
  2017-02-20 15:16     ` Bin.Cheng
  0 siblings, 1 reply; 13+ messages in thread
From: Jan Hubicka @ 2017-02-20 14:21 UTC (permalink / raw)
  To: Richard Biener; +Cc: Bin Cheng, Jan Hubicka, gcc-patches, nd, pthaugen

> > 2017-02-16  Bin Cheng  <bin.cheng@arm.com>
> >
> >         PR tree-optimization/77536
> >         * tree-ssa-loop-manip.c (niter_for_unrolled_loop): New function.
> >         (tree_transform_and_unroll_loop): Use above function to compute the
> >         estimated niter of unrolled loop.
> >         * tree-ssa-loop-manip.h niter_for_unrolled_loop(): New declaration.
> >         * tree-vect-loop.c (scale_profile_for_vect_loop): New function.
> >         (vect_transform_loop): Call above function.

+/* Return estimated niter for LOOP after unrolling by FACTOR times.  */
+
+unsigned
+niter_for_unrolled_loop (struct loop *loop, unsigned factor)
+{
+  unsigned est_niter = expected_loop_iterations (loop);

What happens when you have profile and loop iterates very many times?
Perhaps we want to do all calculation in gcov_type and use
expected_loop_iterations_unbounded>?

expected_loop_iterations is capping by 10000 that is easy to overflow.

+  gcc_assert (factor != 0);
+  unsigned new_est_niter = est_niter / factor;
+
+  /* Without profile feedback, loops for which we do not know a better estimate
+     are assumed to roll 10 times.  When we unroll such loop, it appears to
+     roll too little, and it may even seem to be cold.  To avoid this, we
+     ensure that the created loop appears to roll at least 5 times (but at
+     most as many times as before unrolling).  */
+  if (new_est_niter < 5)
+    {
+      if (est_niter < 5)
+	new_est_niter = est_niter;
+      else
+	new_est_niter = 5;
+    }
+
+  return new_est_niter;
+}

I see this code is pre-existing, but please extend it to test if
loop->header->count is non-zero.  Even if we do not have idea about loop
iteration count estimate we may end up predicting more than 10 iterations when
predictors combine that way.

Perhaps testing estimated-loop_iterations would also make sense, but that
could be dealt with incrementally.

+static void
+scale_profile_for_vect_loop (struct loop *loop, unsigned vf)
+{
+  unsigned freq_h = loop->header->frequency;
+  unsigned freq_e = EDGE_FREQUENCY (loop_preheader_edge (loop));
+  /* Reduce loop iterations by the vectorization factor.  */
+  unsigned new_est_niter = niter_for_unrolled_loop (loop, vf);
+
+  if (freq_h != 0)
+    scale_loop_frequencies (loop, freq_e * (new_est_niter + 1), freq_h);
+
I am always trying to avoid propagating small mistakes (i.e. frong freq_h or
freq_h) into bigger mistakes (i.e. wrong profile of the whole loop) to avoid
spreading mistakes across cfg.

But I guess here it is sort of safe because vectorized loops are simple.
You can't just scale down the existing counts/frequencies by vf, because the
entry edge frequency was adjusted.

Also niter_for_unrolled_loop depends on sanity of the profile, so perhaps you
need to compute it before you start chanigng the CFG by peeling proplogue?

Finally if freq_e is 0, all frequencies and counts will be probably dropped to
0.  What about determining fraction by counts if they are available?

Otherwise the patch looks good and thanks a lot for working on this!

Honza

> >
> > gcc/testsuite/ChangeLog
> > 2017-02-16  Bin Cheng  <bin.cheng@arm.com>
> >
> >         PR tree-optimization/77536
> >         * gcc.dg/vect/pr79347.c: Revise testing string.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH PR77536]Generate correct profiling information for vectorized loop
  2017-02-20 14:21   ` Jan Hubicka
@ 2017-02-20 15:16     ` Bin.Cheng
  2017-02-20 15:44       ` Jan Hubicka
  0 siblings, 1 reply; 13+ messages in thread
From: Bin.Cheng @ 2017-02-20 15:16 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: Richard Biener, Bin Cheng, gcc-patches, nd, pthaugen

On Mon, Feb 20, 2017 at 2:02 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> > 2017-02-16  Bin Cheng  <bin.cheng@arm.com>
>> >
>> >         PR tree-optimization/77536
>> >         * tree-ssa-loop-manip.c (niter_for_unrolled_loop): New function.
>> >         (tree_transform_and_unroll_loop): Use above function to compute the
>> >         estimated niter of unrolled loop.
>> >         * tree-ssa-loop-manip.h niter_for_unrolled_loop(): New declaration.
>> >         * tree-vect-loop.c (scale_profile_for_vect_loop): New function.
>> >         (vect_transform_loop): Call above function.
Thanks very much for your suggestions.  I don't know profiling logic
very well and have some questions embedded below before I start
revising patch.
>
> +/* Return estimated niter for LOOP after unrolling by FACTOR times.  */
> +
> +unsigned
> +niter_for_unrolled_loop (struct loop *loop, unsigned factor)
> +{
> +  unsigned est_niter = expected_loop_iterations (loop);
>
> What happens when you have profile and loop iterates very many times?
> Perhaps we want to do all calculation in gcov_type and use
> expected_loop_iterations_unbounded>?
>
> expected_loop_iterations is capping by 10000 that is easy to overflow.
>
> +  gcc_assert (factor != 0);
> +  unsigned new_est_niter = est_niter / factor;
> +
> +  /* Without profile feedback, loops for which we do not know a better estimate
> +     are assumed to roll 10 times.  When we unroll such loop, it appears to
> +     roll too little, and it may even seem to be cold.  To avoid this, we
> +     ensure that the created loop appears to roll at least 5 times (but at
> +     most as many times as before unrolling).  */
> +  if (new_est_niter < 5)
> +    {
> +      if (est_niter < 5)
> +       new_est_niter = est_niter;
> +      else
> +       new_est_niter = 5;
> +    }
> +
> +  return new_est_niter;
> +}
>
> I see this code is pre-existing, but please extend it to test if
> loop->header->count is non-zero.  Even if we do not have idea about loop
> iteration count estimate we may end up predicting more than 10 iterations when
> predictors combine that way.
If I use expected_loop_iterations_unbounded, then do I need to handle
loop->header->count explicitly here?  I suppose not because it has
below code already:

  /* If we have no profile at all, use AVG_LOOP_NITER.  */
  if (profile_status_for_fn (cfun) == PROFILE_ABSENT)
    expected = PARAM_VALUE (PARAM_AVG_LOOP_NITER);
  else if (loop->latch && (loop->latch->count || loop->header->count))
    {
      gcov_type count_in, count_latch;
      //...

The second question is: looks like it only takes latch->count into
consideration when PROFILE_ABSENT.  But according to your comments, we
could have nonzero count sometime?

>
> Perhaps testing estimated-loop_iterations would also make sense, but that
> could be dealt with incrementally.
>
> +static void
> +scale_profile_for_vect_loop (struct loop *loop, unsigned vf)
> +{
> +  unsigned freq_h = loop->header->frequency;
> +  unsigned freq_e = EDGE_FREQUENCY (loop_preheader_edge (loop));
> +  /* Reduce loop iterations by the vectorization factor.  */
> +  unsigned new_est_niter = niter_for_unrolled_loop (loop, vf);
> +
> +  if (freq_h != 0)
> +    scale_loop_frequencies (loop, freq_e * (new_est_niter + 1), freq_h);
> +
> I am always trying to avoid propagating small mistakes (i.e. frong freq_h or
> freq_h) into bigger mistakes (i.e. wrong profile of the whole loop) to avoid
> spreading mistakes across cfg.
>
> But I guess here it is sort of safe because vectorized loops are simple.
> You can't just scale down the existing counts/frequencies by vf, because the
> entry edge frequency was adjusted.
I am not 100% follow here, it looks the code avoids changing frequency
counter for preheader/exit edge, otherwise we would need to change all
counters dominated by them?
>
> Also niter_for_unrolled_loop depends on sanity of the profile, so perhaps you
> need to compute it before you start chanigng the CFG by peeling proplogue?
Peeling for prologue doesn't change profiling information of
vect_loop, it is the skip edge from before loop to preferred epilogue
loop that will change profile counters.  I guess here exists a dilemma
that niter_for_unrolled_loop is for loop after peeling for prologue?

Thanks,
bin
>
> Finally if freq_e is 0, all frequencies and counts will be probably dropped to
> 0.  What about determining fraction by counts if they are available?
>
> Otherwise the patch looks good and thanks a lot for working on this!
>
> Honza
>
>> >
>> > gcc/testsuite/ChangeLog
>> > 2017-02-16  Bin Cheng  <bin.cheng@arm.com>
>> >
>> >         PR tree-optimization/77536
>> >         * gcc.dg/vect/pr79347.c: Revise testing string.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH PR77536]Generate correct profiling information for vectorized loop
  2017-02-20 15:16     ` Bin.Cheng
@ 2017-02-20 15:44       ` Jan Hubicka
  2017-02-20 16:05         ` Bin.Cheng
  0 siblings, 1 reply; 13+ messages in thread
From: Jan Hubicka @ 2017-02-20 15:44 UTC (permalink / raw)
  To: Bin.Cheng
  Cc: Jan Hubicka, Richard Biener, Bin Cheng, gcc-patches, nd, pthaugen

> > +  /* Without profile feedback, loops for which we do not know a better estimate
> > +     are assumed to roll 10 times.  When we unroll such loop, it appears to
> > +     roll too little, and it may even seem to be cold.  To avoid this, we
> > +     ensure that the created loop appears to roll at least 5 times (but at
> > +     most as many times as before unrolling).  */
> > +  if (new_est_niter < 5)
> > +    {
> > +      if (est_niter < 5)
> > +       new_est_niter = est_niter;
> > +      else
> > +       new_est_niter = 5;
> > +    }
> > +
> > +  return new_est_niter;
> > +}
> >
> > I see this code is pre-existing, but please extend it to test if
> > loop->header->count is non-zero.  Even if we do not have idea about loop
> > iteration count estimate we may end up predicting more than 10 iterations when
> > predictors combine that way.
> If I use expected_loop_iterations_unbounded, then do I need to handle
> loop->header->count explicitly here?  I suppose not because it has
> below code already:
> 
>   /* If we have no profile at all, use AVG_LOOP_NITER.  */
>   if (profile_status_for_fn (cfun) == PROFILE_ABSENT)
>     expected = PARAM_VALUE (PARAM_AVG_LOOP_NITER);
>   else if (loop->latch && (loop->latch->count || loop->header->count))
>     {
>       gcov_type count_in, count_latch;
>       //...

expected_loop_iterations_unbounded avoids capping the # of iterations comming
from profile feedback (so if loop iterates 1000000 it retns 1000000 instead of
10000). Testing loop->header->count for non-zero tests if the loop was ever
entered in the train run so it is easy test if you have any profile feedback
for a given loop available.

I profile feedback is there, then making usre that new_est_niter is at least
5 will only make feedback less acurate.  I am not sure if we vectorize loop
that will iterate only 1...4 times after unrolling, but I guess we could,
and it would be desirable to peel it afterwards.
> 
> The second question is: looks like it only takes latch->count into
> consideration when PROFILE_ABSENT.  But according to your comments, we
> could have nonzero count sometime?

profile_status_for_fn (cfun) == PROFILE_ABSENT is set only when there is no
profile esitmate at all - i.e. you compile with
-fno-guess-branch-probabilities.  We don't really use that path very often
but in that case you have no data to guess your profile from.

> 
> >
> > Perhaps testing estimated-loop_iterations would also make sense, but that
> > could be dealt with incrementally.
> >
> > +static void
> > +scale_profile_for_vect_loop (struct loop *loop, unsigned vf)
> > +{
> > +  unsigned freq_h = loop->header->frequency;
> > +  unsigned freq_e = EDGE_FREQUENCY (loop_preheader_edge (loop));
> > +  /* Reduce loop iterations by the vectorization factor.  */
> > +  unsigned new_est_niter = niter_for_unrolled_loop (loop, vf);
> > +
> > +  if (freq_h != 0)
> > +    scale_loop_frequencies (loop, freq_e * (new_est_niter + 1), freq_h);
> > +
> > I am always trying to avoid propagating small mistakes (i.e. frong freq_h or
> > freq_h) into bigger mistakes (i.e. wrong profile of the whole loop) to avoid
> > spreading mistakes across cfg.
> >
> > But I guess here it is sort of safe because vectorized loops are simple.
> > You can't just scale down the existing counts/frequencies by vf, because the
> > entry edge frequency was adjusted.
> I am not 100% follow here, it looks the code avoids changing frequency
> counter for preheader/exit edge, otherwise we would need to change all
> counters dominated by them?

I was just wondering what is wrong with taking the existing frequencies/counts
the loop body has and dividing them all by the unroll factor.  This is correct
if you ignore the versioning. With versioning I guess you want to scale by
the unroll factor and also subtract frequencies/counts that was acocunted to
the other versions of the loop, right?
> >
> > Also niter_for_unrolled_loop depends on sanity of the profile, so perhaps you
> > need to compute it before you start chanigng the CFG by peeling proplogue?
> Peeling for prologue doesn't change profiling information of
> vect_loop, it is the skip edge from before loop to preferred epilogue
> loop that will change profile counters.  I guess here exists a dilemma
> that niter_for_unrolled_loop is for loop after peeling for prologue?

expected_loop_iterations_unbounded calculates number of iteations by computing
sum of frequencies of edges entering the loop and comparing it to the frequency
of loop header.  While peeling the prologue, you split the preheader edge and
adjust frequency of the new preheader BB of the loop to be vectorized.  I think
that will adjust the #of iterations estimate.

Honza
> 
> Thanks,
> bin
> >
> > Finally if freq_e is 0, all frequencies and counts will be probably dropped to
> > 0.  What about determining fraction by counts if they are available?
> >
> > Otherwise the patch looks good and thanks a lot for working on this!
> >
> > Honza
> >
> >> >
> >> > gcc/testsuite/ChangeLog
> >> > 2017-02-16  Bin Cheng  <bin.cheng@arm.com>
> >> >
> >> >         PR tree-optimization/77536
> >> >         * gcc.dg/vect/pr79347.c: Revise testing string.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH PR77536]Generate correct profiling information for vectorized loop
  2017-02-20 15:44       ` Jan Hubicka
@ 2017-02-20 16:05         ` Bin.Cheng
  2017-02-20 17:02           ` Jan Hubicka
  0 siblings, 1 reply; 13+ messages in thread
From: Bin.Cheng @ 2017-02-20 16:05 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: Richard Biener, Bin Cheng, gcc-patches, nd, pthaugen

On Mon, Feb 20, 2017 at 3:17 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> > +  /* Without profile feedback, loops for which we do not know a better estimate
>> > +     are assumed to roll 10 times.  When we unroll such loop, it appears to
>> > +     roll too little, and it may even seem to be cold.  To avoid this, we
>> > +     ensure that the created loop appears to roll at least 5 times (but at
>> > +     most as many times as before unrolling).  */
>> > +  if (new_est_niter < 5)
>> > +    {
>> > +      if (est_niter < 5)
>> > +       new_est_niter = est_niter;
>> > +      else
>> > +       new_est_niter = 5;
>> > +    }
>> > +
>> > +  return new_est_niter;
>> > +}
>> >
>> > I see this code is pre-existing, but please extend it to test if
>> > loop->header->count is non-zero.  Even if we do not have idea about loop
>> > iteration count estimate we may end up predicting more than 10 iterations when
>> > predictors combine that way.
>> If I use expected_loop_iterations_unbounded, then do I need to handle
>> loop->header->count explicitly here?  I suppose not because it has
>> below code already:
>>
>>   /* If we have no profile at all, use AVG_LOOP_NITER.  */
>>   if (profile_status_for_fn (cfun) == PROFILE_ABSENT)
>>     expected = PARAM_VALUE (PARAM_AVG_LOOP_NITER);
>>   else if (loop->latch && (loop->latch->count || loop->header->count))
>>     {
>>       gcov_type count_in, count_latch;
>>       //...
>
> expected_loop_iterations_unbounded avoids capping the # of iterations comming
> from profile feedback (so if loop iterates 1000000 it retns 1000000 instead of
> 10000). Testing loop->header->count for non-zero tests if the loop was ever
> entered in the train run so it is easy test if you have any profile feedback
> for a given loop available.
BTW, if we use gcov_type in calculation from expected_loop_iterations_unbounded,
how should we adjust the number for calling scale_loop_frequencies
which has int type?  In extreme case, gcov_type could be out of int's
range, we have to cap the value anyway.  But yes, 10000 in
expect_loop_iterations is too small.
>
> I profile feedback is there, then making usre that new_est_niter is at least
> 5 will only make feedback less acurate.  I am not sure if we vectorize loop
> that will iterate only 1...4 times after unrolling, but I guess we could,
> and it would be desirable to peel it afterwards.
Right, I think we can discard this lower bound if profiling
information is present.
>>
>> The second question is: looks like it only takes latch->count into
>> consideration when PROFILE_ABSENT.  But according to your comments, we
>> could have nonzero count sometime?
>
> profile_status_for_fn (cfun) == PROFILE_ABSENT is set only when there is no
> profile esitmate at all - i.e. you compile with
> -fno-guess-branch-probabilities.  We don't really use that path very often
> but in that case you have no data to guess your profile from.
>
>>
>> >
>> > Perhaps testing estimated-loop_iterations would also make sense, but that
>> > could be dealt with incrementally.
>> >
>> > +static void
>> > +scale_profile_for_vect_loop (struct loop *loop, unsigned vf)
>> > +{
>> > +  unsigned freq_h = loop->header->frequency;
>> > +  unsigned freq_e = EDGE_FREQUENCY (loop_preheader_edge (loop));
>> > +  /* Reduce loop iterations by the vectorization factor.  */
>> > +  unsigned new_est_niter = niter_for_unrolled_loop (loop, vf);
>> > +
>> > +  if (freq_h != 0)
>> > +    scale_loop_frequencies (loop, freq_e * (new_est_niter + 1), freq_h);
>> > +
>> > I am always trying to avoid propagating small mistakes (i.e. frong freq_h or
>> > freq_h) into bigger mistakes (i.e. wrong profile of the whole loop) to avoid
>> > spreading mistakes across cfg.
>> >
>> > But I guess here it is sort of safe because vectorized loops are simple.
>> > You can't just scale down the existing counts/frequencies by vf, because the
>> > entry edge frequency was adjusted.
>> I am not 100% follow here, it looks the code avoids changing frequency
>> counter for preheader/exit edge, otherwise we would need to change all
>> counters dominated by them?
>
> I was just wondering what is wrong with taking the existing frequencies/counts
> the loop body has and dividing them all by the unroll factor.  This is correct
> if you ignore the versioning. With versioning I guess you want to scale by
> the unroll factor and also subtract frequencies/counts that was acocunted to
> the other versions of the loop, right?
IIUC, for (vectorized) loop header, it's frequency is calculated by:
          freq_header = freq_preheader + freq_latch
and freq_latch = (niter * freq_preheader).  Simply scaling it by VF gives:
          freq_header = (freq_preheader + freq_latch) / VF
which is wrong.  Especially if the loop is vectorized by large VF
(=16) and we normally assume niter (=10) without profiling
information, it results not only mismatch, but also
(loop->header->frequency < loop->preheader->frequency).  In fact, if
we have accurate niter information, the counter should be:
          freq_header = freq_preheader + niter * freq_preheader

>> >
>> > Also niter_for_unrolled_loop depends on sanity of the profile, so perhaps you
>> > need to compute it before you start chanigng the CFG by peeling proplogue?
>> Peeling for prologue doesn't change profiling information of
>> vect_loop, it is the skip edge from before loop to preferred epilogue
>> loop that will change profile counters.  I guess here exists a dilemma
>> that niter_for_unrolled_loop is for loop after peeling for prologue?
>
> expected_loop_iterations_unbounded calculates number of iteations by computing
> sum of frequencies of edges entering the loop and comparing it to the frequency
> of loop header.  While peeling the prologue, you split the preheader edge and
> adjust frequency of the new preheader BB of the loop to be vectorized.  I think
> that will adjust the #of iterations estimate.
It's not the case now I think.  one motivation of new vect_do_peeling
is to avoid niter checks between prologue and vector loop.  Once
prologue loop is entered or checked, the vector loop must be executed
unconditionally.  So the preheaderof vector loop has consistent
frequency counters now.  The niter check on whether vector loop should
be executed is now merged with cost check before prologue, and in the
future I think this can be further merged if loop versioning is
needed.

Thanks,
bin
>
> Honza
>>
>> Thanks,
>> bin
>> >
>> > Finally if freq_e is 0, all frequencies and counts will be probably dropped to
>> > 0.  What about determining fraction by counts if they are available?
>> >
>> > Otherwise the patch looks good and thanks a lot for working on this!
>> >
>> > Honza
>> >
>> >> >
>> >> > gcc/testsuite/ChangeLog
>> >> > 2017-02-16  Bin Cheng  <bin.cheng@arm.com>
>> >> >
>> >> >         PR tree-optimization/77536
>> >> >         * gcc.dg/vect/pr79347.c: Revise testing string.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH PR77536]Generate correct profiling information for vectorized loop
  2017-02-20 16:05         ` Bin.Cheng
@ 2017-02-20 17:02           ` Jan Hubicka
  2017-02-20 17:53             ` Bin.Cheng
  2017-02-21 14:48             ` Bin.Cheng
  0 siblings, 2 replies; 13+ messages in thread
From: Jan Hubicka @ 2017-02-20 17:02 UTC (permalink / raw)
  To: Bin.Cheng
  Cc: Jan Hubicka, Richard Biener, Bin Cheng, gcc-patches, nd, pthaugen

> BTW, if we use gcov_type in calculation from expected_loop_iterations_unbounded,
> how should we adjust the number for calling scale_loop_frequencies
> which has int type?  In extreme case, gcov_type could be out of int's
> range, we have to cap the value anyway.  But yes, 10000 in
> expect_loop_iterations is too small.

What I usually do is to use fixed point math in this case (based on REG_BR_PROB_BASE).
Just pass REG_BR_PROB_BASE as den and calculate the adjustment in gcov_type converting
to int. Because you should be just decreasing the counts, it won't overflow and because
the decarese will be in range, say 2...256 times, it should also be sufficiently
precise.

Be careful to avoid overflow of gcov type - it is not safe to multiply two
counts in 64bit math because each count can be more than 2^32.  (next stage1 I
plan to switch most of this to sreals that will make this easier)

> >> > But I guess here it is sort of safe because vectorized loops are simple.
> >> > You can't just scale down the existing counts/frequencies by vf, because the
> >> > entry edge frequency was adjusted.
> >> I am not 100% follow here, it looks the code avoids changing frequency
> >> counter for preheader/exit edge, otherwise we would need to change all
> >> counters dominated by them?
> >
> > I was just wondering what is wrong with taking the existing frequencies/counts
> > the loop body has and dividing them all by the unroll factor.  This is correct
> > if you ignore the versioning. With versioning I guess you want to scale by
> > the unroll factor and also subtract frequencies/counts that was acocunted to
> > the other versions of the loop, right?
> IIUC, for (vectorized) loop header, it's frequency is calculated by:
>           freq_header = freq_preheader + freq_latch
> and freq_latch = (niter * freq_preheader).  Simply scaling it by VF gives:
>           freq_header = (freq_preheader + freq_latch) / VF
> which is wrong.  Especially if the loop is vectorized by large VF
> (=16) and we normally assume niter (=10) without profiling
> information, it results not only mismatch, but also
> (loop->header->frequency < loop->preheader->frequency).  In fact, if
> we have accurate niter information, the counter should be:
>           freq_header = freq_preheader + niter * freq_preheader

You are right. We need to compensate for the change of probability of the loop
exit edge.
> 
> >> >
> >> > Also niter_for_unrolled_loop depends on sanity of the profile, so perhaps you
> >> > need to compute it before you start chanigng the CFG by peeling proplogue?
> >> Peeling for prologue doesn't change profiling information of
> >> vect_loop, it is the skip edge from before loop to preferred epilogue
> >> loop that will change profile counters.  I guess here exists a dilemma
> >> that niter_for_unrolled_loop is for loop after peeling for prologue?
> >
> > expected_loop_iterations_unbounded calculates number of iteations by computing
> > sum of frequencies of edges entering the loop and comparing it to the frequency
> > of loop header.  While peeling the prologue, you split the preheader edge and
> > adjust frequency of the new preheader BB of the loop to be vectorized.  I think
> > that will adjust the #of iterations estimate.
> It's not the case now I think.  one motivation of new vect_do_peeling
> is to avoid niter checks between prologue and vector loop.  Once
> prologue loop is entered or checked, the vector loop must be executed
> unconditionally.  So the preheaderof vector loop has consistent
> frequency counters now.  The niter check on whether vector loop should
> be executed is now merged with cost check before prologue, and in the
> future I think this can be further merged if loop versioning is
> needed.

Originally you have

  loop_preheader
       |
       v
   loop_header

and the ratio of the two BB frequencies is the loop iteration count. Then you
do something like:

  orig_loop_preheader
       |
       v
   loop_prologue -----> scalar_version_of_loop
       |
       v
 new_loop_preheader
       |
       v
   loop_header

At some point, you need to update new_loop_preheader frequency/count
to reflect the fact that with some probability the loop_prologue avoids
the vectorized loop.  Once you do it and if you don't scale frequency of
loop_header you will make expect_loop_iterations to return higher value
than previously.

So at the time you are calling it, you need to be sure that the loop_header
and its preheader frequences was both adjusted by same factor.  Or you need
to call it early before you start hacking on the CFG and its profile.

Pehaps currently it is safe, because your peeling code is also scaling
the loop profiles.

Honza
> 
> Thanks,
> bin
> >
> > Honza
> >>
> >> Thanks,
> >> bin
> >> >
> >> > Finally if freq_e is 0, all frequencies and counts will be probably dropped to
> >> > 0.  What about determining fraction by counts if they are available?
> >> >
> >> > Otherwise the patch looks good and thanks a lot for working on this!
> >> >
> >> > Honza
> >> >
> >> >> >
> >> >> > gcc/testsuite/ChangeLog
> >> >> > 2017-02-16  Bin Cheng  <bin.cheng@arm.com>
> >> >> >
> >> >> >         PR tree-optimization/77536
> >> >> >         * gcc.dg/vect/pr79347.c: Revise testing string.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH PR77536]Generate correct profiling information for vectorized loop
  2017-02-20 17:02           ` Jan Hubicka
@ 2017-02-20 17:53             ` Bin.Cheng
  2017-02-21 14:48             ` Bin.Cheng
  1 sibling, 0 replies; 13+ messages in thread
From: Bin.Cheng @ 2017-02-20 17:53 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: Richard Biener, Bin Cheng, gcc-patches, nd, pthaugen

On Mon, Feb 20, 2017 at 4:05 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> BTW, if we use gcov_type in calculation from expected_loop_iterations_unbounded,
>> how should we adjust the number for calling scale_loop_frequencies
>> which has int type?  In extreme case, gcov_type could be out of int's
>> range, we have to cap the value anyway.  But yes, 10000 in
>> expect_loop_iterations is too small.
>
> What I usually do is to use fixed point math in this case (based on REG_BR_PROB_BASE).
> Just pass REG_BR_PROB_BASE as den and calculate the adjustment in gcov_type converting
> to int. Because you should be just decreasing the counts, it won't overflow and because
> the decarese will be in range, say 2...256 times, it should also be sufficiently
> precise.
>
> Be careful to avoid overflow of gcov type - it is not safe to multiply two
> counts in 64bit math because each count can be more than 2^32.  (next stage1 I
> plan to switch most of this to sreals that will make this easier)
>
>> >> > But I guess here it is sort of safe because vectorized loops are simple.
>> >> > You can't just scale down the existing counts/frequencies by vf, because the
>> >> > entry edge frequency was adjusted.
>> >> I am not 100% follow here, it looks the code avoids changing frequency
>> >> counter for preheader/exit edge, otherwise we would need to change all
>> >> counters dominated by them?
>> >
>> > I was just wondering what is wrong with taking the existing frequencies/counts
>> > the loop body has and dividing them all by the unroll factor.  This is correct
>> > if you ignore the versioning. With versioning I guess you want to scale by
>> > the unroll factor and also subtract frequencies/counts that was acocunted to
>> > the other versions of the loop, right?
>> IIUC, for (vectorized) loop header, it's frequency is calculated by:
>>           freq_header = freq_preheader + freq_latch
>> and freq_latch = (niter * freq_preheader).  Simply scaling it by VF gives:
>>           freq_header = (freq_preheader + freq_latch) / VF
>> which is wrong.  Especially if the loop is vectorized by large VF
>> (=16) and we normally assume niter (=10) without profiling
>> information, it results not only mismatch, but also
>> (loop->header->frequency < loop->preheader->frequency).  In fact, if
>> we have accurate niter information, the counter should be:
>>           freq_header = freq_preheader + niter * freq_preheader
>
> You are right. We need to compensate for the change of probability of the loop
> exit edge.
>>
>> >> >
>> >> > Also niter_for_unrolled_loop depends on sanity of the profile, so perhaps you
>> >> > need to compute it before you start chanigng the CFG by peeling proplogue?
>> >> Peeling for prologue doesn't change profiling information of
>> >> vect_loop, it is the skip edge from before loop to preferred epilogue
>> >> loop that will change profile counters.  I guess here exists a dilemma
>> >> that niter_for_unrolled_loop is for loop after peeling for prologue?
>> >
>> > expected_loop_iterations_unbounded calculates number of iteations by computing
>> > sum of frequencies of edges entering the loop and comparing it to the frequency
>> > of loop header.  While peeling the prologue, you split the preheader edge and
>> > adjust frequency of the new preheader BB of the loop to be vectorized.  I think
>> > that will adjust the #of iterations estimate.
>> It's not the case now I think.  one motivation of new vect_do_peeling
>> is to avoid niter checks between prologue and vector loop.  Once
>> prologue loop is entered or checked, the vector loop must be executed
>> unconditionally.  So the preheaderof vector loop has consistent
>> frequency counters now.  The niter check on whether vector loop should
>> be executed is now merged with cost check before prologue, and in the
>> future I think this can be further merged if loop versioning is
>> needed.
>
> Originally you have
>
>   loop_preheader
>        |
>        v
>    loop_header
>
> and the ratio of the two BB frequencies is the loop iteration count. Then you
> do something like:
>
>   orig_loop_preheader
>        |
>        v
>    loop_prologue -----> scalar_version_of_loop
>        |
>        v
>  new_loop_preheader
>        |
>        v
>    loop_header
It's like:

   orig_loop_preheader
        |
        v
    check_on_niter -----> scalar_version_of_loop (i.e, epilog_loop)
        |
        v
    loop_prologue
        |
        v
  new_loop_preheader
        |
        v
    loop_header

Yes, the preheader/header need to be consistent here, and it is now.
Thanks for explaining.

> At some point, you need to update new_loop_preheader frequency/count
> to reflect the fact that with some probability the loop_prologue avoids
> the vectorized loop.  Once you do it and if you don't scale frequency of
> loop_header you will make expect_loop_iterations to return higher value
> than previously.
>
> So at the time you are calling it, you need to be sure that the loop_header
> and its preheader frequences was both adjusted by same factor.  Or you need
> to call it early before you start hacking on the CFG and its profile.
>
> Pehaps currently it is safe, because your peeling code is also scaling
> the loop profiles.

I will revise the patch according to this discussion.

Thanks,
bin
>
> Honza
>>
>> Thanks,
>> bin
>> >
>> > Honza
>> >>
>> >> Thanks,
>> >> bin
>> >> >
>> >> > Finally if freq_e is 0, all frequencies and counts will be probably dropped to
>> >> > 0.  What about determining fraction by counts if they are available?
>> >> >
>> >> > Otherwise the patch looks good and thanks a lot for working on this!
>> >> >
>> >> > Honza
>> >> >
>> >> >> >
>> >> >> > gcc/testsuite/ChangeLog
>> >> >> > 2017-02-16  Bin Cheng  <bin.cheng@arm.com>
>> >> >> >
>> >> >> >         PR tree-optimization/77536
>> >> >> >         * gcc.dg/vect/pr79347.c: Revise testing string.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH PR77536]Generate correct profiling information for vectorized loop
  2017-02-20 17:02           ` Jan Hubicka
  2017-02-20 17:53             ` Bin.Cheng
@ 2017-02-21 14:48             ` Bin.Cheng
  2017-02-21 15:52               ` Jan Hubicka
  1 sibling, 1 reply; 13+ messages in thread
From: Bin.Cheng @ 2017-02-21 14:48 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: Richard Biener, gcc-patches, pthaugen

[-- Attachment #1: Type: text/plain, Size: 5590 bytes --]

On Mon, Feb 20, 2017 at 4:05 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> BTW, if we use gcov_type in calculation from expected_loop_iterations_unbounded,
>> how should we adjust the number for calling scale_loop_frequencies
>> which has int type?  In extreme case, gcov_type could be out of int's
>> range, we have to cap the value anyway.  But yes, 10000 in
>> expect_loop_iterations is too small.
>
> What I usually do is to use fixed point math in this case (based on REG_BR_PROB_BASE).
> Just pass REG_BR_PROB_BASE as den and calculate the adjustment in gcov_type converting
> to int. Because you should be just decreasing the counts, it won't overflow and because
> the decarese will be in range, say 2...256 times, it should also be sufficiently
> precise.
>
> Be careful to avoid overflow of gcov type - it is not safe to multiply two
> counts in 64bit math because each count can be more than 2^32.  (next stage1 I
> plan to switch most of this to sreals that will make this easier)
>
>> >> > But I guess here it is sort of safe because vectorized loops are simple.
>> >> > You can't just scale down the existing counts/frequencies by vf, because the
>> >> > entry edge frequency was adjusted.
>> >> I am not 100% follow here, it looks the code avoids changing frequency
>> >> counter for preheader/exit edge, otherwise we would need to change all
>> >> counters dominated by them?
>> >
>> > I was just wondering what is wrong with taking the existing frequencies/counts
>> > the loop body has and dividing them all by the unroll factor.  This is correct
>> > if you ignore the versioning. With versioning I guess you want to scale by
>> > the unroll factor and also subtract frequencies/counts that was acocunted to
>> > the other versions of the loop, right?
>> IIUC, for (vectorized) loop header, it's frequency is calculated by:
>>           freq_header = freq_preheader + freq_latch
>> and freq_latch = (niter * freq_preheader).  Simply scaling it by VF gives:
>>           freq_header = (freq_preheader + freq_latch) / VF
>> which is wrong.  Especially if the loop is vectorized by large VF
>> (=16) and we normally assume niter (=10) without profiling
>> information, it results not only mismatch, but also
>> (loop->header->frequency < loop->preheader->frequency).  In fact, if
>> we have accurate niter information, the counter should be:
>>           freq_header = freq_preheader + niter * freq_preheader
>
> You are right. We need to compensate for the change of probability of the loop
> exit edge.
>>
>> >> >
>> >> > Also niter_for_unrolled_loop depends on sanity of the profile, so perhaps you
>> >> > need to compute it before you start chanigng the CFG by peeling proplogue?
>> >> Peeling for prologue doesn't change profiling information of
>> >> vect_loop, it is the skip edge from before loop to preferred epilogue
>> >> loop that will change profile counters.  I guess here exists a dilemma
>> >> that niter_for_unrolled_loop is for loop after peeling for prologue?
>> >
>> > expected_loop_iterations_unbounded calculates number of iteations by computing
>> > sum of frequencies of edges entering the loop and comparing it to the frequency
>> > of loop header.  While peeling the prologue, you split the preheader edge and
>> > adjust frequency of the new preheader BB of the loop to be vectorized.  I think
>> > that will adjust the #of iterations estimate.
>> It's not the case now I think.  one motivation of new vect_do_peeling
>> is to avoid niter checks between prologue and vector loop.  Once
>> prologue loop is entered or checked, the vector loop must be executed
>> unconditionally.  So the preheaderof vector loop has consistent
>> frequency counters now.  The niter check on whether vector loop should
>> be executed is now merged with cost check before prologue, and in the
>> future I think this can be further merged if loop versioning is
>> needed.
>
> Originally you have
>
>   loop_preheader
>        |
>        v
>    loop_header
>
> and the ratio of the two BB frequencies is the loop iteration count. Then you
> do something like:
>
>   orig_loop_preheader
>        |
>        v
>    loop_prologue -----> scalar_version_of_loop
>        |
>        v
>  new_loop_preheader
>        |
>        v
>    loop_header
>
> At some point, you need to update new_loop_preheader frequency/count
> to reflect the fact that with some probability the loop_prologue avoids
> the vectorized loop.  Once you do it and if you don't scale frequency of
> loop_header you will make expect_loop_iterations to return higher value
> than previously.
>
> So at the time you are calling it, you need to be sure that the loop_header
> and its preheader frequences was both adjusted by same factor.  Or you need
> to call it early before you start hacking on the CFG and its profile.
>
> Pehaps currently it is safe, because your peeling code is also scaling
> the loop profiles.
Hi Honza,
Attachment is the updated patch.  Bootstrap and test ongoing, any comments?

Thanks,
bin

2017-02-21  Bin Cheng  <bin.cheng@arm.com>

PR tree-optimization/77536
* tree-ssa-loop-manip.c (niter_for_unrolled_loop): New function.
(tree_transform_and_unroll_loop): Use above function to compute the
estimated niter of unrolled loop and use it when scaling profile.
* tree-ssa-loop-manip.h niter_for_unrolled_loop(): New declaration.
* tree-vect-loop.c (scale_profile_for_vect_loop): New function.
(vect_transform_loop): Call above function.

gcc/testsuite/ChangeLog
2017-02-21  Bin Cheng  <bin.cheng@arm.com>

PR tree-optimization/77536
* gcc.dg/vect/pr79347.c: Revise testing string.

[-- Attachment #2: pr77536-20170217.txt --]
[-- Type: text/plain, Size: 7646 bytes --]

diff --git a/gcc/testsuite/gcc.dg/vect/pr79347.c b/gcc/testsuite/gcc.dg/vect/pr79347.c
index 586c638..6825420 100644
--- a/gcc/testsuite/gcc.dg/vect/pr79347.c
+++ b/gcc/testsuite/gcc.dg/vect/pr79347.c
@@ -10,4 +10,4 @@ void n(void)
     a[i]++;
 }
 
-/* { dg-final { scan-tree-dump-times "Invalid sum of " 2 "vect" } } */
+/* { dg-final { scan-tree-dump-not "Invalid sum of " "vect" } } */
diff --git a/gcc/tree-ssa-loop-manip.c b/gcc/tree-ssa-loop-manip.c
index 43df29c..cf2cd48 100644
--- a/gcc/tree-ssa-loop-manip.c
+++ b/gcc/tree-ssa-loop-manip.c
@@ -1093,6 +1093,33 @@ scale_dominated_blocks_in_loop (struct loop *loop, basic_block bb,
     }
 }
 
+/* Return estimated niter for LOOP after unrolling by FACTOR times.  */
+
+gcov_type
+niter_for_unrolled_loop (struct loop *loop, unsigned factor)
+{
+  gcc_assert (factor != 0);
+  bool profile_p = false;
+  gcov_type est_niter = expected_loop_iterations_unbounded (loop, &profile_p);
+  gcov_type new_est_niter = est_niter / factor;
+
+  /* Without profile feedback, loops for which we do not know a better estimate
+     are assumed to roll 10 times.  When we unroll such loop, it appears to
+     roll too little, and it may even seem to be cold.  To avoid this, we
+     ensure that the created loop appears to roll at least 5 times (but at
+     most as many times as before unrolling).  Don't do adjustment if profile
+     feedback is present.  */
+  if (new_est_niter < 5 && !profile_p)
+    {
+      if (est_niter < 5)
+	new_est_niter = est_niter;
+      else
+	new_est_niter = 5;
+    }
+
+  return new_est_niter;
+}
+
 /* Unroll LOOP FACTOR times.  DESC describes number of iterations of LOOP.
    EXIT is the exit of the loop to that DESC corresponds.
 
@@ -1170,12 +1197,11 @@ tree_transform_and_unroll_loop (struct loop *loop, unsigned factor,
   gimple_stmt_iterator bsi;
   use_operand_p op;
   bool ok;
-  unsigned est_niter, prob_entry, scale_unrolled, scale_rest, freq_e, freq_h;
-  unsigned new_est_niter, i, prob;
+  unsigned i, prob, prob_entry, scale_unrolled, scale_rest, freq_e, freq_h;
+  gcov_type new_est_niter = niter_for_unrolled_loop (loop, factor);
   unsigned irr = loop_preheader_edge (loop)->flags & EDGE_IRREDUCIBLE_LOOP;
   auto_vec<edge> to_remove;
 
-  est_niter = expected_loop_iterations (loop);
   determine_exit_conditions (loop, desc, factor,
 			     &enter_main_cond, &exit_base, &exit_step,
 			     &exit_cmp, &exit_bound);
@@ -1207,22 +1233,6 @@ tree_transform_and_unroll_loop (struct loop *loop, unsigned factor,
   gcc_assert (new_loop != NULL);
   update_ssa (TODO_update_ssa);
 
-  /* Determine the probability of the exit edge of the unrolled loop.  */
-  new_est_niter = est_niter / factor;
-
-  /* Without profile feedback, loops for that we do not know a better estimate
-     are assumed to roll 10 times.  When we unroll such loop, it appears to
-     roll too little, and it may even seem to be cold.  To avoid this, we
-     ensure that the created loop appears to roll at least 5 times (but at
-     most as many times as before unrolling).  */
-  if (new_est_niter < 5)
-    {
-      if (est_niter < 5)
-	new_est_niter = est_niter;
-      else
-	new_est_niter = 5;
-    }
-
   /* Prepare the cfg and update the phi nodes.  Move the loop exit to the
      loop latch (and make its condition dummy, for the moment).  */
   rest = loop_preheader_edge (new_loop)->src;
@@ -1329,7 +1339,12 @@ tree_transform_and_unroll_loop (struct loop *loop, unsigned factor,
   freq_h = loop->header->frequency;
   freq_e = EDGE_FREQUENCY (loop_preheader_edge (loop));
   if (freq_h != 0)
-    scale_loop_frequencies (loop, freq_e * (new_est_niter + 1), freq_h);
+    {
+      gcov_type scale;
+      /* This should not overflow.  */
+      scale = GCOV_COMPUTE_SCALE (freq_e * (new_est_niter + 1), freq_h);
+      scale_loop_frequencies (loop, scale, REG_BR_PROB_BASE);
+    }
 
   exit_bb = single_pred (loop->latch);
   new_exit = find_edge (exit_bb, rest);
diff --git a/gcc/tree-ssa-loop-manip.h b/gcc/tree-ssa-loop-manip.h
index 1e7531f..a139050 100644
--- a/gcc/tree-ssa-loop-manip.h
+++ b/gcc/tree-ssa-loop-manip.h
@@ -48,6 +48,7 @@ extern bool gimple_duplicate_loop_to_header_edge (struct loop *, edge,
 						  int);
 extern bool can_unroll_loop_p (struct loop *loop, unsigned factor,
 			       struct tree_niter_desc *niter);
+extern gcov_type niter_for_unrolled_loop (struct loop *, unsigned);
 extern void tree_transform_and_unroll_loop (struct loop *, unsigned,
 					    edge, struct tree_niter_desc *,
 					    transform_callback, void *);
diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
index c5a1627..2ca03c5 100644
--- a/gcc/tree-vect-loop.c
+++ b/gcc/tree-vect-loop.c
@@ -6718,6 +6718,48 @@ loop_niters_no_overflow (loop_vec_info loop_vinfo)
   return false;
 }
 
+/* Scale profiling counters by estimation for LOOP which is vectorized
+   by factor VF.  */
+
+static void
+scale_profile_for_vect_loop (struct loop *loop, unsigned vf)
+{
+  edge preheader = loop_preheader_edge (loop);
+  unsigned freq_h = loop->header->frequency;
+  unsigned freq_e = EDGE_FREQUENCY (preheader);
+  /* Reduce loop iterations by the vectorization factor.  */
+  gcov_type new_est_niter = niter_for_unrolled_loop (loop, vf);
+
+  /* Use profiling count information if frequencies are zero.  */
+  if (freq_h == 0 || freq_e == 0)
+    {
+      freq_e = preheader->count;
+      freq_h = loop->header->count;
+    }
+
+  if (freq_h != 0)
+    {
+      gcov_type scale;
+      /* This should not overflow.  */
+      scale = GCOV_COMPUTE_SCALE (freq_e * (new_est_niter + 1), freq_h);
+      scale_loop_frequencies (loop, scale, REG_BR_PROB_BASE);
+    }
+
+  basic_block exit_bb = single_pred (loop->latch);
+  edge exit_e = single_exit (loop);
+  exit_e->count = loop_preheader_edge (loop)->count;
+  exit_e->probability = REG_BR_PROB_BASE / (new_est_niter + 1);
+
+  edge exit_l = single_pred_edge (loop->latch);
+  int prob = exit_l->probability;
+  exit_l->probability = REG_BR_PROB_BASE - exit_e->probability;
+  exit_l->count = exit_bb->count - exit_e->count;
+  if (exit_l->count < 0)
+    exit_l->count = 0;
+  if (prob > 0)
+    scale_bbs_frequencies_int (&loop->latch, 1, exit_l->probability, prob);
+}
+
 /* Function vect_transform_loop.
 
    The analysis phase has determined that the loop is vectorizable.
@@ -6743,16 +6785,10 @@ vect_transform_loop (loop_vec_info loop_vinfo)
   bool transform_pattern_stmt = false;
   bool check_profitability = false;
   int th;
-  /* Record number of iterations before we started tampering with the profile. */
-  gcov_type expected_iterations = expected_loop_iterations_unbounded (loop);
 
   if (dump_enabled_p ())
     dump_printf_loc (MSG_NOTE, vect_location, "=== vec_transform_loop ===\n");
 
-  /* If profile is inprecise, we have chance to fix it up.  */
-  if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo))
-    expected_iterations = LOOP_VINFO_INT_NITERS (loop_vinfo);
-
   /* Use the more conservative vectorization threshold.  If the number
      of iterations is constant assume the cost check has been performed
      by our caller.  If the threshold makes all loops profitable that
@@ -7068,9 +7104,8 @@ vect_transform_loop (loop_vec_info loop_vinfo)
 
   slpeel_make_loop_iterate_ntimes (loop, niters_vector);
 
-  /* Reduce loop iterations by the vectorization factor.  */
-  scale_loop_profile (loop, GCOV_COMPUTE_SCALE (1, vf),
-		      expected_iterations / vf);
+  scale_profile_for_vect_loop (loop, vf);
+
   /* The minimum number of iterations performed by the epilogue.  This
      is 1 when peeling for gaps because we always need a final scalar
      iteration.  */

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH PR77536]Generate correct profiling information for vectorized loop
  2017-02-21 14:48             ` Bin.Cheng
@ 2017-02-21 15:52               ` Jan Hubicka
  2017-02-22 12:23                 ` Bin.Cheng
  0 siblings, 1 reply; 13+ messages in thread
From: Jan Hubicka @ 2017-02-21 15:52 UTC (permalink / raw)
  To: Bin.Cheng; +Cc: Jan Hubicka, Richard Biener, gcc-patches, pthaugen

> 2017-02-21  Bin Cheng  <bin.cheng@arm.com>
> 
> PR tree-optimization/77536
> * tree-ssa-loop-manip.c (niter_for_unrolled_loop): New function.
> (tree_transform_and_unroll_loop): Use above function to compute the
> estimated niter of unrolled loop and use it when scaling profile.
> * tree-ssa-loop-manip.h niter_for_unrolled_loop(): New declaration.
> * tree-vect-loop.c (scale_profile_for_vect_loop): New function.
> (vect_transform_loop): Call above function.
> 
> gcc/testsuite/ChangeLog
> 2017-02-21  Bin Cheng  <bin.cheng@arm.com>
> 
> PR tree-optimization/77536
> * gcc.dg/vect/pr79347.c: Revise testing string.
> @@ -1329,7 +1339,12 @@ tree_transform_and_unroll_loop (struct loop *loop, unsigned factor,
>    freq_h = loop->header->frequency;
>    freq_e = EDGE_FREQUENCY (loop_preheader_edge (loop));
>    if (freq_h != 0)
> -    scale_loop_frequencies (loop, freq_e * (new_est_niter + 1), freq_h);
> +    {
> +      gcov_type scale;
> +      /* This should not overflow.  */
> +      scale = GCOV_COMPUTE_SCALE (freq_e * (new_est_niter + 1), freq_h);
> +      scale_loop_frequencies (loop, scale, REG_BR_PROB_BASE);

You need to use counts counts when new_est_niter is derrived from profile feedback.
This is because frequencies are capped to 10000, so if loop iterates very many times,
new_est_niter will be large, freq_h will be 10000 and freq_e will be 0.

Also watch the case when freq_e==loop_preheader_edge (loop)->count==0 and freq_h
is non-zero.  Just do MAX (freq_e, 1). This will not drop the loop body profile to 0.

> +/* Scale profiling counters by estimation for LOOP which is vectorized
> +   by factor VF.  */
> +
> +static void
> +scale_profile_for_vect_loop (struct loop *loop, unsigned vf)
> +{
> +  edge preheader = loop_preheader_edge (loop);
> +  unsigned freq_h = loop->header->frequency;
> +  unsigned freq_e = EDGE_FREQUENCY (preheader);
> +  /* Reduce loop iterations by the vectorization factor.  */
> +  gcov_type new_est_niter = niter_for_unrolled_loop (loop, vf);
> +
> +  /* Use profiling count information if frequencies are zero.  */
> +  if (freq_h == 0 || freq_e == 0)
> +    {
> +      freq_e = preheader->count;
> +      freq_h = loop->header->count;
> +    }
> +
> +  if (freq_h != 0)
> +    {
> +      gcov_type scale;
> +      /* This should not overflow.  */
> +      scale = GCOV_COMPUTE_SCALE (freq_e * (new_est_niter + 1), freq_h);
> +      scale_loop_frequencies (loop, scale, REG_BR_PROB_BASE);
> +    }

Similarly here. Use counts when they are non-zero and use MAX (freq_e, 1).
freq_e/freq_h needs to be gcov_type in that case.

Patch is OK with these changes.  Thanks a lot!
Honza
> +
> +  basic_block exit_bb = single_pred (loop->latch);
> +  edge exit_e = single_exit (loop);
> +  exit_e->count = loop_preheader_edge (loop)->count;
> +  exit_e->probability = REG_BR_PROB_BASE / (new_est_niter + 1);
> +
> +  edge exit_l = single_pred_edge (loop->latch);
> +  int prob = exit_l->probability;
> +  exit_l->probability = REG_BR_PROB_BASE - exit_e->probability;
> +  exit_l->count = exit_bb->count - exit_e->count;
> +  if (exit_l->count < 0)
> +    exit_l->count = 0;
> +  if (prob > 0)
> +    scale_bbs_frequencies_int (&loop->latch, 1, exit_l->probability, prob);
> +}
> +
>  /* Function vect_transform_loop.
>  
>     The analysis phase has determined that the loop is vectorizable.
> @@ -6743,16 +6785,10 @@ vect_transform_loop (loop_vec_info loop_vinfo)
>    bool transform_pattern_stmt = false;
>    bool check_profitability = false;
>    int th;
> -  /* Record number of iterations before we started tampering with the profile. */
> -  gcov_type expected_iterations = expected_loop_iterations_unbounded (loop);
>  
>    if (dump_enabled_p ())
>      dump_printf_loc (MSG_NOTE, vect_location, "=== vec_transform_loop ===\n");
>  
> -  /* If profile is inprecise, we have chance to fix it up.  */
> -  if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo))
> -    expected_iterations = LOOP_VINFO_INT_NITERS (loop_vinfo);
> -
>    /* Use the more conservative vectorization threshold.  If the number
>       of iterations is constant assume the cost check has been performed
>       by our caller.  If the threshold makes all loops profitable that
> @@ -7068,9 +7104,8 @@ vect_transform_loop (loop_vec_info loop_vinfo)
>  
>    slpeel_make_loop_iterate_ntimes (loop, niters_vector);
>  
> -  /* Reduce loop iterations by the vectorization factor.  */
> -  scale_loop_profile (loop, GCOV_COMPUTE_SCALE (1, vf),
> -		      expected_iterations / vf);
> +  scale_profile_for_vect_loop (loop, vf);
> +
>    /* The minimum number of iterations performed by the epilogue.  This
>       is 1 when peeling for gaps because we always need a final scalar
>       iteration.  */

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH PR77536]Generate correct profiling information for vectorized loop
  2017-02-21 15:52               ` Jan Hubicka
@ 2017-02-22 12:23                 ` Bin.Cheng
  2017-02-22 14:59                   ` Jan Hubicka
  0 siblings, 1 reply; 13+ messages in thread
From: Bin.Cheng @ 2017-02-22 12:23 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: Richard Biener, gcc-patches, pthaugen

[-- Attachment #1: Type: text/plain, Size: 2863 bytes --]

On Tue, Feb 21, 2017 at 3:49 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> 2017-02-21  Bin Cheng  <bin.cheng@arm.com>
>>
>> PR tree-optimization/77536
>> * tree-ssa-loop-manip.c (niter_for_unrolled_loop): New function.
>> (tree_transform_and_unroll_loop): Use above function to compute the
>> estimated niter of unrolled loop and use it when scaling profile.
>> * tree-ssa-loop-manip.h niter_for_unrolled_loop(): New declaration.
>> * tree-vect-loop.c (scale_profile_for_vect_loop): New function.
>> (vect_transform_loop): Call above function.
>>
>> gcc/testsuite/ChangeLog
>> 2017-02-21  Bin Cheng  <bin.cheng@arm.com>
>>
>> PR tree-optimization/77536
>> * gcc.dg/vect/pr79347.c: Revise testing string.
>> @@ -1329,7 +1339,12 @@ tree_transform_and_unroll_loop (struct loop *loop, unsigned factor,
>>    freq_h = loop->header->frequency;
>>    freq_e = EDGE_FREQUENCY (loop_preheader_edge (loop));
>>    if (freq_h != 0)
>> -    scale_loop_frequencies (loop, freq_e * (new_est_niter + 1), freq_h);
>> +    {
>> +      gcov_type scale;
>> +      /* This should not overflow.  */
>> +      scale = GCOV_COMPUTE_SCALE (freq_e * (new_est_niter + 1), freq_h);
>> +      scale_loop_frequencies (loop, scale, REG_BR_PROB_BASE);
>
> You need to use counts counts when new_est_niter is derrived from profile feedback.
> This is because frequencies are capped to 10000, so if loop iterates very many times,
> new_est_niter will be large, freq_h will be 10000 and freq_e will be 0.
>
> Also watch the case when freq_e==loop_preheader_edge (loop)->count==0 and freq_h
> is non-zero.  Just do MAX (freq_e, 1). This will not drop the loop body profile to 0.
>
>> +/* Scale profiling counters by estimation for LOOP which is vectorized
>> +   by factor VF.  */
>> +
>> +static void
>> +scale_profile_for_vect_loop (struct loop *loop, unsigned vf)
>> +{
>> +  edge preheader = loop_preheader_edge (loop);
>> +  unsigned freq_h = loop->header->frequency;
>> +  unsigned freq_e = EDGE_FREQUENCY (preheader);
>> +  /* Reduce loop iterations by the vectorization factor.  */
>> +  gcov_type new_est_niter = niter_for_unrolled_loop (loop, vf);
>> +
>> +  /* Use profiling count information if frequencies are zero.  */
>> +  if (freq_h == 0 || freq_e == 0)
>> +    {
>> +      freq_e = preheader->count;
>> +      freq_h = loop->header->count;
>> +    }
>> +
>> +  if (freq_h != 0)
>> +    {
>> +      gcov_type scale;
>> +      /* This should not overflow.  */
>> +      scale = GCOV_COMPUTE_SCALE (freq_e * (new_est_niter + 1), freq_h);
>> +      scale_loop_frequencies (loop, scale, REG_BR_PROB_BASE);
>> +    }
>
> Similarly here. Use counts when they are non-zero and use MAX (freq_e, 1).
> freq_e/freq_h needs to be gcov_type in that case.
>
> Patch is OK with these changes.  Thanks a lot!
Hi Honza,
There is the 3rd version patch fixing mentioned issues.  Is it OK?

Thanks,
bin

[-- Attachment #2: pr77536-20170221.txt --]
[-- Type: text/plain, Size: 8350 bytes --]

diff --git a/gcc/testsuite/gcc.dg/vect/pr79347.c b/gcc/testsuite/gcc.dg/vect/pr79347.c
index 586c638..6825420 100644
--- a/gcc/testsuite/gcc.dg/vect/pr79347.c
+++ b/gcc/testsuite/gcc.dg/vect/pr79347.c
@@ -10,4 +10,4 @@ void n(void)
     a[i]++;
 }
 
-/* { dg-final { scan-tree-dump-times "Invalid sum of " 2 "vect" } } */
+/* { dg-final { scan-tree-dump-not "Invalid sum of " "vect" } } */
diff --git a/gcc/tree-ssa-loop-manip.c b/gcc/tree-ssa-loop-manip.c
index 43df29c..22c832a 100644
--- a/gcc/tree-ssa-loop-manip.c
+++ b/gcc/tree-ssa-loop-manip.c
@@ -1093,6 +1093,33 @@ scale_dominated_blocks_in_loop (struct loop *loop, basic_block bb,
     }
 }
 
+/* Return estimated niter for LOOP after unrolling by FACTOR times.  */
+
+gcov_type
+niter_for_unrolled_loop (struct loop *loop, unsigned factor)
+{
+  gcc_assert (factor != 0);
+  bool profile_p = false;
+  gcov_type est_niter = expected_loop_iterations_unbounded (loop, &profile_p);
+  gcov_type new_est_niter = est_niter / factor;
+
+  /* Without profile feedback, loops for which we do not know a better estimate
+     are assumed to roll 10 times.  When we unroll such loop, it appears to
+     roll too little, and it may even seem to be cold.  To avoid this, we
+     ensure that the created loop appears to roll at least 5 times (but at
+     most as many times as before unrolling).  Don't do adjustment if profile
+     feedback is present.  */
+  if (new_est_niter < 5 && !profile_p)
+    {
+      if (est_niter < 5)
+	new_est_niter = est_niter;
+      else
+	new_est_niter = 5;
+    }
+
+  return new_est_niter;
+}
+
 /* Unroll LOOP FACTOR times.  DESC describes number of iterations of LOOP.
    EXIT is the exit of the loop to that DESC corresponds.
 
@@ -1170,12 +1197,12 @@ tree_transform_and_unroll_loop (struct loop *loop, unsigned factor,
   gimple_stmt_iterator bsi;
   use_operand_p op;
   bool ok;
-  unsigned est_niter, prob_entry, scale_unrolled, scale_rest, freq_e, freq_h;
-  unsigned new_est_niter, i, prob;
+  unsigned i, prob, prob_entry, scale_unrolled, scale_rest;
+  gcov_type freq_e, freq_h;
+  gcov_type new_est_niter = niter_for_unrolled_loop (loop, factor);
   unsigned irr = loop_preheader_edge (loop)->flags & EDGE_IRREDUCIBLE_LOOP;
   auto_vec<edge> to_remove;
 
-  est_niter = expected_loop_iterations (loop);
   determine_exit_conditions (loop, desc, factor,
 			     &enter_main_cond, &exit_base, &exit_step,
 			     &exit_cmp, &exit_bound);
@@ -1207,22 +1234,6 @@ tree_transform_and_unroll_loop (struct loop *loop, unsigned factor,
   gcc_assert (new_loop != NULL);
   update_ssa (TODO_update_ssa);
 
-  /* Determine the probability of the exit edge of the unrolled loop.  */
-  new_est_niter = est_niter / factor;
-
-  /* Without profile feedback, loops for that we do not know a better estimate
-     are assumed to roll 10 times.  When we unroll such loop, it appears to
-     roll too little, and it may even seem to be cold.  To avoid this, we
-     ensure that the created loop appears to roll at least 5 times (but at
-     most as many times as before unrolling).  */
-  if (new_est_niter < 5)
-    {
-      if (est_niter < 5)
-	new_est_niter = est_niter;
-      else
-	new_est_niter = 5;
-    }
-
   /* Prepare the cfg and update the phi nodes.  Move the loop exit to the
      loop latch (and make its condition dummy, for the moment).  */
   rest = loop_preheader_edge (new_loop)->src;
@@ -1326,10 +1337,25 @@ tree_transform_and_unroll_loop (struct loop *loop, unsigned factor,
   /* Ensure that the frequencies in the loop match the new estimated
      number of iterations, and change the probability of the new
      exit edge.  */
-  freq_h = loop->header->frequency;
-  freq_e = EDGE_FREQUENCY (loop_preheader_edge (loop));
+
+  freq_h = loop->header->count;
+  freq_e = (loop_preheader_edge (loop))->count;
+  /* Use frequency only if counts are zero.  */
+  if (freq_h == 0 && freq_e == 0)
+    {
+      freq_h = loop->header->frequency;
+      freq_e = EDGE_FREQUENCY (loop_preheader_edge (loop));
+    }
   if (freq_h != 0)
-    scale_loop_frequencies (loop, freq_e * (new_est_niter + 1), freq_h);
+    {
+      gcov_type scale;
+      /* Avoid dropping loop body profile counter to 0 because of zero count
+	 in loop's preheader.  */
+      freq_e = MAX (freq_e, 1);
+      /* This should not overflow.  */
+      scale = GCOV_COMPUTE_SCALE (freq_e * (new_est_niter + 1), freq_h);
+      scale_loop_frequencies (loop, scale, REG_BR_PROB_BASE);
+    }
 
   exit_bb = single_pred (loop->latch);
   new_exit = find_edge (exit_bb, rest);
diff --git a/gcc/tree-ssa-loop-manip.h b/gcc/tree-ssa-loop-manip.h
index 1e7531f..a139050 100644
--- a/gcc/tree-ssa-loop-manip.h
+++ b/gcc/tree-ssa-loop-manip.h
@@ -48,6 +48,7 @@ extern bool gimple_duplicate_loop_to_header_edge (struct loop *, edge,
 						  int);
 extern bool can_unroll_loop_p (struct loop *loop, unsigned factor,
 			       struct tree_niter_desc *niter);
+extern gcov_type niter_for_unrolled_loop (struct loop *, unsigned);
 extern void tree_transform_and_unroll_loop (struct loop *, unsigned,
 					    edge, struct tree_niter_desc *,
 					    transform_callback, void *);
diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
index c5a1627..6bbf816 100644
--- a/gcc/tree-vect-loop.c
+++ b/gcc/tree-vect-loop.c
@@ -6718,6 +6718,50 @@ loop_niters_no_overflow (loop_vec_info loop_vinfo)
   return false;
 }
 
+/* Scale profiling counters by estimation for LOOP which is vectorized
+   by factor VF.  */
+
+static void
+scale_profile_for_vect_loop (struct loop *loop, unsigned vf)
+{
+  edge preheader = loop_preheader_edge (loop);
+  /* Reduce loop iterations by the vectorization factor.  */
+  gcov_type new_est_niter = niter_for_unrolled_loop (loop, vf);
+  gcov_type freq_h = loop->header->count, freq_e = preheader->count;
+
+  /* Use frequency only if counts are zero.  */
+  if (freq_h == 0 && freq_e == 0)
+    {
+      freq_h = loop->header->frequency;
+      freq_e = EDGE_FREQUENCY (preheader);
+    }
+  if (freq_h != 0)
+    {
+      gcov_type scale;
+
+      /* Avoid dropping loop body profile counter to 0 because of zero count
+	 in loop's preheader.  */
+      freq_e = MAX (freq_e, 1);
+      /* This should not overflow.  */
+      scale = GCOV_COMPUTE_SCALE (freq_e * (new_est_niter + 1), freq_h);
+      scale_loop_frequencies (loop, scale, REG_BR_PROB_BASE);
+    }
+
+  basic_block exit_bb = single_pred (loop->latch);
+  edge exit_e = single_exit (loop);
+  exit_e->count = loop_preheader_edge (loop)->count;
+  exit_e->probability = REG_BR_PROB_BASE / (new_est_niter + 1);
+
+  edge exit_l = single_pred_edge (loop->latch);
+  int prob = exit_l->probability;
+  exit_l->probability = REG_BR_PROB_BASE - exit_e->probability;
+  exit_l->count = exit_bb->count - exit_e->count;
+  if (exit_l->count < 0)
+    exit_l->count = 0;
+  if (prob > 0)
+    scale_bbs_frequencies_int (&loop->latch, 1, exit_l->probability, prob);
+}
+
 /* Function vect_transform_loop.
 
    The analysis phase has determined that the loop is vectorizable.
@@ -6743,16 +6787,10 @@ vect_transform_loop (loop_vec_info loop_vinfo)
   bool transform_pattern_stmt = false;
   bool check_profitability = false;
   int th;
-  /* Record number of iterations before we started tampering with the profile. */
-  gcov_type expected_iterations = expected_loop_iterations_unbounded (loop);
 
   if (dump_enabled_p ())
     dump_printf_loc (MSG_NOTE, vect_location, "=== vec_transform_loop ===\n");
 
-  /* If profile is inprecise, we have chance to fix it up.  */
-  if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo))
-    expected_iterations = LOOP_VINFO_INT_NITERS (loop_vinfo);
-
   /* Use the more conservative vectorization threshold.  If the number
      of iterations is constant assume the cost check has been performed
      by our caller.  If the threshold makes all loops profitable that
@@ -7068,9 +7106,8 @@ vect_transform_loop (loop_vec_info loop_vinfo)
 
   slpeel_make_loop_iterate_ntimes (loop, niters_vector);
 
-  /* Reduce loop iterations by the vectorization factor.  */
-  scale_loop_profile (loop, GCOV_COMPUTE_SCALE (1, vf),
-		      expected_iterations / vf);
+  scale_profile_for_vect_loop (loop, vf);
+
   /* The minimum number of iterations performed by the epilogue.  This
      is 1 when peeling for gaps because we always need a final scalar
      iteration.  */

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH PR77536]Generate correct profiling information for vectorized loop
  2017-02-22 12:23                 ` Bin.Cheng
@ 2017-02-22 14:59                   ` Jan Hubicka
  0 siblings, 0 replies; 13+ messages in thread
From: Jan Hubicka @ 2017-02-22 14:59 UTC (permalink / raw)
  To: Bin.Cheng; +Cc: Jan Hubicka, Richard Biener, gcc-patches, pthaugen

> Hi Honza,
OK,
thanks!

Honz
> There is the 3rd version patch fixing mentioned issues.  Is it OK?
> 
> Thanks,
> bin

> diff --git a/gcc/testsuite/gcc.dg/vect/pr79347.c b/gcc/testsuite/gcc.dg/vect/pr79347.c
> index 586c638..6825420 100644
> --- a/gcc/testsuite/gcc.dg/vect/pr79347.c
> +++ b/gcc/testsuite/gcc.dg/vect/pr79347.c
> @@ -10,4 +10,4 @@ void n(void)
>      a[i]++;
>  }
>  
> -/* { dg-final { scan-tree-dump-times "Invalid sum of " 2 "vect" } } */
> +/* { dg-final { scan-tree-dump-not "Invalid sum of " "vect" } } */
> diff --git a/gcc/tree-ssa-loop-manip.c b/gcc/tree-ssa-loop-manip.c
> index 43df29c..22c832a 100644
> --- a/gcc/tree-ssa-loop-manip.c
> +++ b/gcc/tree-ssa-loop-manip.c
> @@ -1093,6 +1093,33 @@ scale_dominated_blocks_in_loop (struct loop *loop, basic_block bb,
>      }
>  }
>  
> +/* Return estimated niter for LOOP after unrolling by FACTOR times.  */
> +
> +gcov_type
> +niter_for_unrolled_loop (struct loop *loop, unsigned factor)
> +{
> +  gcc_assert (factor != 0);
> +  bool profile_p = false;
> +  gcov_type est_niter = expected_loop_iterations_unbounded (loop, &profile_p);
> +  gcov_type new_est_niter = est_niter / factor;
> +
> +  /* Without profile feedback, loops for which we do not know a better estimate
> +     are assumed to roll 10 times.  When we unroll such loop, it appears to
> +     roll too little, and it may even seem to be cold.  To avoid this, we
> +     ensure that the created loop appears to roll at least 5 times (but at
> +     most as many times as before unrolling).  Don't do adjustment if profile
> +     feedback is present.  */
> +  if (new_est_niter < 5 && !profile_p)
> +    {
> +      if (est_niter < 5)
> +	new_est_niter = est_niter;
> +      else
> +	new_est_niter = 5;
> +    }
> +
> +  return new_est_niter;
> +}
> +
>  /* Unroll LOOP FACTOR times.  DESC describes number of iterations of LOOP.
>     EXIT is the exit of the loop to that DESC corresponds.
>  
> @@ -1170,12 +1197,12 @@ tree_transform_and_unroll_loop (struct loop *loop, unsigned factor,
>    gimple_stmt_iterator bsi;
>    use_operand_p op;
>    bool ok;
> -  unsigned est_niter, prob_entry, scale_unrolled, scale_rest, freq_e, freq_h;
> -  unsigned new_est_niter, i, prob;
> +  unsigned i, prob, prob_entry, scale_unrolled, scale_rest;
> +  gcov_type freq_e, freq_h;
> +  gcov_type new_est_niter = niter_for_unrolled_loop (loop, factor);
>    unsigned irr = loop_preheader_edge (loop)->flags & EDGE_IRREDUCIBLE_LOOP;
>    auto_vec<edge> to_remove;
>  
> -  est_niter = expected_loop_iterations (loop);
>    determine_exit_conditions (loop, desc, factor,
>  			     &enter_main_cond, &exit_base, &exit_step,
>  			     &exit_cmp, &exit_bound);
> @@ -1207,22 +1234,6 @@ tree_transform_and_unroll_loop (struct loop *loop, unsigned factor,
>    gcc_assert (new_loop != NULL);
>    update_ssa (TODO_update_ssa);
>  
> -  /* Determine the probability of the exit edge of the unrolled loop.  */
> -  new_est_niter = est_niter / factor;
> -
> -  /* Without profile feedback, loops for that we do not know a better estimate
> -     are assumed to roll 10 times.  When we unroll such loop, it appears to
> -     roll too little, and it may even seem to be cold.  To avoid this, we
> -     ensure that the created loop appears to roll at least 5 times (but at
> -     most as many times as before unrolling).  */
> -  if (new_est_niter < 5)
> -    {
> -      if (est_niter < 5)
> -	new_est_niter = est_niter;
> -      else
> -	new_est_niter = 5;
> -    }
> -
>    /* Prepare the cfg and update the phi nodes.  Move the loop exit to the
>       loop latch (and make its condition dummy, for the moment).  */
>    rest = loop_preheader_edge (new_loop)->src;
> @@ -1326,10 +1337,25 @@ tree_transform_and_unroll_loop (struct loop *loop, unsigned factor,
>    /* Ensure that the frequencies in the loop match the new estimated
>       number of iterations, and change the probability of the new
>       exit edge.  */
> -  freq_h = loop->header->frequency;
> -  freq_e = EDGE_FREQUENCY (loop_preheader_edge (loop));
> +
> +  freq_h = loop->header->count;
> +  freq_e = (loop_preheader_edge (loop))->count;
> +  /* Use frequency only if counts are zero.  */
> +  if (freq_h == 0 && freq_e == 0)
> +    {
> +      freq_h = loop->header->frequency;
> +      freq_e = EDGE_FREQUENCY (loop_preheader_edge (loop));
> +    }
>    if (freq_h != 0)
> -    scale_loop_frequencies (loop, freq_e * (new_est_niter + 1), freq_h);
> +    {
> +      gcov_type scale;
> +      /* Avoid dropping loop body profile counter to 0 because of zero count
> +	 in loop's preheader.  */
> +      freq_e = MAX (freq_e, 1);
> +      /* This should not overflow.  */
> +      scale = GCOV_COMPUTE_SCALE (freq_e * (new_est_niter + 1), freq_h);
> +      scale_loop_frequencies (loop, scale, REG_BR_PROB_BASE);
> +    }
>  
>    exit_bb = single_pred (loop->latch);
>    new_exit = find_edge (exit_bb, rest);
> diff --git a/gcc/tree-ssa-loop-manip.h b/gcc/tree-ssa-loop-manip.h
> index 1e7531f..a139050 100644
> --- a/gcc/tree-ssa-loop-manip.h
> +++ b/gcc/tree-ssa-loop-manip.h
> @@ -48,6 +48,7 @@ extern bool gimple_duplicate_loop_to_header_edge (struct loop *, edge,
>  						  int);
>  extern bool can_unroll_loop_p (struct loop *loop, unsigned factor,
>  			       struct tree_niter_desc *niter);
> +extern gcov_type niter_for_unrolled_loop (struct loop *, unsigned);
>  extern void tree_transform_and_unroll_loop (struct loop *, unsigned,
>  					    edge, struct tree_niter_desc *,
>  					    transform_callback, void *);
> diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
> index c5a1627..6bbf816 100644
> --- a/gcc/tree-vect-loop.c
> +++ b/gcc/tree-vect-loop.c
> @@ -6718,6 +6718,50 @@ loop_niters_no_overflow (loop_vec_info loop_vinfo)
>    return false;
>  }
>  
> +/* Scale profiling counters by estimation for LOOP which is vectorized
> +   by factor VF.  */
> +
> +static void
> +scale_profile_for_vect_loop (struct loop *loop, unsigned vf)
> +{
> +  edge preheader = loop_preheader_edge (loop);
> +  /* Reduce loop iterations by the vectorization factor.  */
> +  gcov_type new_est_niter = niter_for_unrolled_loop (loop, vf);
> +  gcov_type freq_h = loop->header->count, freq_e = preheader->count;
> +
> +  /* Use frequency only if counts are zero.  */
> +  if (freq_h == 0 && freq_e == 0)
> +    {
> +      freq_h = loop->header->frequency;
> +      freq_e = EDGE_FREQUENCY (preheader);
> +    }
> +  if (freq_h != 0)
> +    {
> +      gcov_type scale;
> +
> +      /* Avoid dropping loop body profile counter to 0 because of zero count
> +	 in loop's preheader.  */
> +      freq_e = MAX (freq_e, 1);
> +      /* This should not overflow.  */
> +      scale = GCOV_COMPUTE_SCALE (freq_e * (new_est_niter + 1), freq_h);
> +      scale_loop_frequencies (loop, scale, REG_BR_PROB_BASE);
> +    }
> +
> +  basic_block exit_bb = single_pred (loop->latch);
> +  edge exit_e = single_exit (loop);
> +  exit_e->count = loop_preheader_edge (loop)->count;
> +  exit_e->probability = REG_BR_PROB_BASE / (new_est_niter + 1);
> +
> +  edge exit_l = single_pred_edge (loop->latch);
> +  int prob = exit_l->probability;
> +  exit_l->probability = REG_BR_PROB_BASE - exit_e->probability;
> +  exit_l->count = exit_bb->count - exit_e->count;
> +  if (exit_l->count < 0)
> +    exit_l->count = 0;
> +  if (prob > 0)
> +    scale_bbs_frequencies_int (&loop->latch, 1, exit_l->probability, prob);
> +}
> +
>  /* Function vect_transform_loop.
>  
>     The analysis phase has determined that the loop is vectorizable.
> @@ -6743,16 +6787,10 @@ vect_transform_loop (loop_vec_info loop_vinfo)
>    bool transform_pattern_stmt = false;
>    bool check_profitability = false;
>    int th;
> -  /* Record number of iterations before we started tampering with the profile. */
> -  gcov_type expected_iterations = expected_loop_iterations_unbounded (loop);
>  
>    if (dump_enabled_p ())
>      dump_printf_loc (MSG_NOTE, vect_location, "=== vec_transform_loop ===\n");
>  
> -  /* If profile is inprecise, we have chance to fix it up.  */
> -  if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo))
> -    expected_iterations = LOOP_VINFO_INT_NITERS (loop_vinfo);
> -
>    /* Use the more conservative vectorization threshold.  If the number
>       of iterations is constant assume the cost check has been performed
>       by our caller.  If the threshold makes all loops profitable that
> @@ -7068,9 +7106,8 @@ vect_transform_loop (loop_vec_info loop_vinfo)
>  
>    slpeel_make_loop_iterate_ntimes (loop, niters_vector);
>  
> -  /* Reduce loop iterations by the vectorization factor.  */
> -  scale_loop_profile (loop, GCOV_COMPUTE_SCALE (1, vf),
> -		      expected_iterations / vf);
> +  scale_profile_for_vect_loop (loop, vf);
> +
>    /* The minimum number of iterations performed by the epilogue.  This
>       is 1 when peeling for gaps because we always need a final scalar
>       iteration.  */

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2017-02-22 12:30 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-02-16 18:38 [PATCH PR77536]Generate correct profiling information for vectorized loop Bin Cheng
2017-02-17  1:39 ` Pat Haugen
2017-02-20 12:54 ` Richard Biener
2017-02-20 14:21   ` Jan Hubicka
2017-02-20 15:16     ` Bin.Cheng
2017-02-20 15:44       ` Jan Hubicka
2017-02-20 16:05         ` Bin.Cheng
2017-02-20 17:02           ` Jan Hubicka
2017-02-20 17:53             ` Bin.Cheng
2017-02-21 14:48             ` Bin.Cheng
2017-02-21 15:52               ` Jan Hubicka
2017-02-22 12:23                 ` Bin.Cheng
2017-02-22 14:59                   ` Jan Hubicka

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).