public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* [PATCH PR79347]Maintain profile counter information in vect_do_peeling
@ 2017-02-14 13:09 Bin Cheng
  2017-02-14 14:13 ` Jan Hubicka
  0 siblings, 1 reply; 6+ messages in thread
From: Bin Cheng @ 2017-02-14 13:09 UTC (permalink / raw)
  To: gcc-patches; +Cc: nd

[-- Attachment #1: Type: text/plain, Size: 2344 bytes --]

Hi,
This patch fixes issue reported by PR79347 by calculating/maintaining profile counter information
on the fly in vect_do_peeling.  Due to the order that we first peel prologue loop, peel epilogue loop,
and then add guarding edge skipping prolog+vector loop if niter is small, this patch takes a trick
that firstly scales down counters for loop before peeling and scales counters back after adding the
aforementioned guarding edge.  Otherwise, more work would be needed to calculate counters for
prolog and vector loop. After this patch, # of profile counter for tramp3d benchmark is improved from:

tramp3d-v4.cpp.157t.ifcvt:296
tramp3d-v4.cpp.158t.vect:1118
tramp3d-v4.cpp.159t.dce6:1118
tramp3d-v4.cpp.160t.pcom:1118
tramp3d-v4.cpp.161t.cunroll:1019
tramp3d-v4.cpp.162t.slp1:1019
tramp3d-v4.cpp.164t.ivopts:1019
tramp3d-v4.cpp.165t.lim4:1019
tramp3d-v4.cpp.166t.loopdone:1007
tramp3d-v4.cpp.167t.no_loop:31
...
tramp3d-v4.cpp.226t.optimized:1009

to:

tramp3d-v4.cpp.157t.ifcvt:296
tramp3d-v4.cpp.158t.vect:814
tramp3d-v4.cpp.159t.dce6:814
tramp3d-v4.cpp.160t.pcom:814
tramp3d-v4.cpp.161t.cunroll:723
tramp3d-v4.cpp.162t.slp1:723
tramp3d-v4.cpp.164t.ivopts:723
tramp3d-v4.cpp.165t.lim4:723
tramp3d-v4.cpp.166t.loopdone:711
tramp3d-v4.cpp.167t.no_loop:31
...
tramp3d-v4.cpp.226t.optimized:831

Bootstrap and test on x86_64 and AArch64.  Is it OK?

BTW, with the patch, vectorizer only introduces mismatches by below code in vect_transform_loop:

  /* Reduce loop iterations by the vectorization factor.  */
  scale_loop_profile (loop, GCOV_COMPUTE_SCALE (1, vf),
		      expected_iterations / vf);

Though it makes sense to scale down according to vect-factor, but it definitely introduces
mismatch between vector_loop's frequency and the rest program.  I also believe it is not
that useful to scale here, especially without profiling information.  At least we need to make
vector_loop's frequency consistent with the rest program.

Thanks,
bin
2017-02-13  Bin Cheng  <bin.cheng@arm.com>

	PR tree-optimization/79347
	* tree-vect-loop-manip.c (apply_probability_for_bb): New function.
	(vect_do_peeling): Maintain profile counters during peeling.

gcc/testsuite/ChangeLog
2017-02-13  Bin Cheng  <bin.cheng@arm.com>

	PR tree-optimization/79347
	* gcc.dg/vect/pr79347.c: New test.

[-- Attachment #2: pr79347-20170209.txt --]
[-- Type: text/plain, Size: 4683 bytes --]

diff --git a/gcc/testsuite/gcc.dg/vect/pr79347.c b/gcc/testsuite/gcc.dg/vect/pr79347.c
new file mode 100644
index 0000000..586c638
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/pr79347.c
@@ -0,0 +1,13 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target vect_int } */
+/* { dg-additional-options "-fdump-tree-vect-all" } */
+
+short *a;
+int c;
+void n(void)
+{
+  for (int i = 0; i<c;i++)
+    a[i]++;
+}
+
+/* { dg-final { scan-tree-dump-times "Invalid sum of " 2 "vect" } } */
diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c
index f29449c..e6c481c 100644
--- a/gcc/tree-vect-loop-manip.c
+++ b/gcc/tree-vect-loop-manip.c
@@ -1562,6 +1562,17 @@ slpeel_update_phi_nodes_for_lcssa (struct loop *epilog)
     rename_use_op (PHI_ARG_DEF_PTR_FROM_EDGE (gsi.phi (), e));
 }
 
+/* Apply probability PROB to basic block BB and its single succ edge.  */
+
+static void
+apply_probability_for_bb (basic_block bb, int prob)
+{
+  bb->frequency = apply_probability (bb->frequency, prob);
+  bb->count = apply_probability (bb->count, prob);
+  gcc_assert (single_succ_p (bb));
+  single_succ_edge (bb)->count = bb->count;
+}
+
 /* Function vect_do_peeling.
 
    Input:
@@ -1690,7 +1701,18 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
      may be preferred.  */
   basic_block anchor = loop_preheader_edge (loop)->src;
   if (skip_vector)
-    split_edge (loop_preheader_edge (loop));
+    {
+      split_edge (loop_preheader_edge (loop));
+
+      /* Due to the order in which we peel prolog and epilog, we first
+	 propagate probability to the whole loop.  The purpose is to
+	 avoid adjusting probabilities of both prolog and vector loops
+	 separately.  Note in this case, the probability of epilog loop
+	 needs to be scaled back later.  */
+      basic_block bb_before_loop = loop_preheader_edge (loop)->src;
+      apply_probability_for_bb (bb_before_loop, prob_vector);
+      scale_loop_profile (loop, prob_vector, bound);
+    }
 
   tree niters_prolog = build_int_cst (type, 0);
   source_location loop_loc = find_loop_location (loop);
@@ -1727,6 +1749,7 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
 	  guard_cond = fold_build2 (EQ_EXPR, boolean_type_node,
 				    niters_prolog, build_int_cst (type, 0));
 	  guard_bb = loop_preheader_edge (prolog)->src;
+	  basic_block bb_after_prolog = loop_preheader_edge (loop)->src;
 	  guard_to = split_edge (loop_preheader_edge (loop));
 	  guard_e = slpeel_add_loop_guard (guard_bb, guard_cond,
 					   guard_to, guard_bb,
@@ -1734,6 +1757,8 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
 	  e = EDGE_PRED (guard_to, 0);
 	  e = (e != guard_e ? e : EDGE_PRED (guard_to, 1));
 	  slpeel_update_phi_nodes_for_guard1 (prolog, loop, guard_e, e);
+
+	  apply_probability_for_bb (bb_after_prolog, prob_prolog);
 	  scale_loop_profile (prolog, prob_prolog, bound_prolog);
 	}
       /* Update init address of DRs.  */
@@ -1796,9 +1821,18 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
 	  e = EDGE_PRED (guard_to, 0);
 	  e = (e != guard_e ? e : EDGE_PRED (guard_to, 1));
 	  slpeel_update_phi_nodes_for_guard1 (first_loop, epilog, guard_e, e);
-	  scale_loop_profile (epilog, prob_vector, bound_scalar);
+
+	  /* Simply propagate profile info from guard_bb to guard_to which is
+	     a merge point of control flow.  */
+	  guard_to->frequency = guard_bb->frequency;
+	  guard_to->count = guard_bb->count;
+	  single_succ_edge (guard_to)->count = guard_to->count;
+	  /* Scale probability of epilog loop back.  */
+	  int scale_up = REG_BR_PROB_BASE * REG_BR_PROB_BASE / prob_vector;
+	  scale_loop_frequencies (epilog, scale_up, REG_BR_PROB_BASE);
 	}
 
+      basic_block bb_before_epilog = loop_preheader_edge (epilog)->src;
       tree niters_vector_mult_vf;
       /* If loop is peeled for non-zero constant times, now niters refers to
 	 orig_niters - prolog_peeling, it won't overflow even the orig_niters
@@ -1826,6 +1860,15 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
 					   inverse_probability (prob_epilog));
 	  slpeel_update_phi_nodes_for_guard2 (loop, epilog, guard_e,
 					      single_exit (epilog));
+	  /* Only need to handle basic block before epilog loop if it's not
+	     the guard_bb, which is the case when skip_vector is true.  */
+	  if (guard_bb != bb_before_epilog)
+	    {
+	      prob_epilog = (combine_probabilities (prob_vector, prob_epilog)
+			     + inverse_probability (prob_vector));
+
+	      apply_probability_for_bb (bb_before_epilog, prob_epilog);
+	    }
 	  scale_loop_profile (epilog, prob_epilog, bound);
 	}
       else

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH PR79347]Maintain profile counter information in vect_do_peeling
  2017-02-14 13:09 [PATCH PR79347]Maintain profile counter information in vect_do_peeling Bin Cheng
@ 2017-02-14 14:13 ` Jan Hubicka
  2017-02-14 14:19   ` Bin.Cheng
  2017-02-14 21:41   ` Pat Haugen
  0 siblings, 2 replies; 6+ messages in thread
From: Jan Hubicka @ 2017-02-14 14:13 UTC (permalink / raw)
  To: Bin Cheng; +Cc: gcc-patches, nd

> Thanks,
> bin
> 2017-02-13  Bin Cheng  <bin.cheng@arm.com>
> 
> 	PR tree-optimization/79347
> 	* tree-vect-loop-manip.c (apply_probability_for_bb): New function.
> 	(vect_do_peeling): Maintain profile counters during peeling.
> 
> gcc/testsuite/ChangeLog
> 2017-02-13  Bin Cheng  <bin.cheng@arm.com>
> 
> 	PR tree-optimization/79347
> 	* gcc.dg/vect/pr79347.c: New test.

> diff --git a/gcc/testsuite/gcc.dg/vect/pr79347.c b/gcc/testsuite/gcc.dg/vect/pr79347.c
> new file mode 100644
> index 0000000..586c638
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/pr79347.c
> @@ -0,0 +1,13 @@
> +/* { dg-do compile } */
> +/* { dg-require-effective-target vect_int } */
> +/* { dg-additional-options "-fdump-tree-vect-all" } */
> +
> +short *a;
> +int c;
> +void n(void)
> +{
> +  for (int i = 0; i<c;i++)
> +    a[i]++;
> +}

Thanks for fixing the prologue.  I think there is still one extra problem in the vectorizer.
With the internal vectorized loop I now see:

;;   basic block 9, loop depth 1, count 0, freq 956, maybe hot
;;   Invalid sum of incoming frequencies 1961, should be 956
;;    prev block 8, next block 10, flags: (NEW, REACHABLE, VISITED)
;;    pred:       10 [100.0%]  (FALLTHRU,DFS_BACK,EXECUTABLE)
;;                8 [100.0%]  (FALLTHRU)
  # i_18 = PHI <i_14(10), i_38(8)>
  # vectp_a.13_66 = PHI <vectp_a.13_67(10), vectp_a.14_64(8)>
  # vectp_a.19_75 = PHI <vectp_a.19_76(10), vectp_a.20_73(8)>
  # ivtmp_78 = PHI <ivtmp_79(10), 0(8)>
  _2 = (long unsigned int) i_18;
  _3 = _2 * 2;
  _4 = a.0_1 + _3;
  vect__5.15_68 = MEM[(short int *)vectp_a.13_66];
  _5 = *_4;
  vect__6.16_69 = VIEW_CONVERT_EXPR<vector(8) unsigned short>(vect__5.15_68);
  _6 = (unsigned short) _5;
  vect__7.17_71 = vect__6.16_69 + vect_cst__70;
  _7 = _6 + 1;
  vect__8.18_72 = VIEW_CONVERT_EXPR<vector(8) short int>(vect__7.17_71);
  _8 = (short int) _7;
  MEM[(short int *)vectp_a.19_75] = vect__8.18_72;
  i_14 = i_18 + 1;
  vectp_a.13_67 = vectp_a.13_66 + 16;
  vectp_a.19_76 = vectp_a.19_75 + 16;
  ivtmp_79 = ivtmp_78 + 1;
  if (ivtmp_79 < bnd.10_59)
    goto <bb 10>; [85.00%]
  else
    goto <bb 11>; [15.00%]

So it seems that the frequency of the loop itself is unrealistically scaled down.
Before vetorizing the frequency is 8500 and predicted number of iterations is
6.6.  Now the loop is intereed via BB 8 with frequency 1148, so the loop, by
exit probability exits with 15% probability and thus still has 6.6 iterations,
but by BB frequencies its body executes fewer times than the preheader.

Now this is a fragile area vectirizing loop should scale number of iterations down
8 times. However guessed CFG profiles are always very "flat". Of course
if loop iterated 6.6 times at the average vectorizing would not make any sense.
Making guessed profiles less flat is unrealistic, because average loop iterates few times,
but of course while vectorizing we make additional guess that the vectorizable loops
matters and the guessed profile is probably unrealistic.

GCC 6 seems however bit more consistent.
> +/* Apply probability PROB to basic block BB and its single succ edge.  */
> +
> +static void
> +apply_probability_for_bb (basic_block bb, int prob)
> +{
> +  bb->frequency = apply_probability (bb->frequency, prob);
> +  bb->count = apply_probability (bb->count, prob);
> +  gcc_assert (single_succ_p (bb));
> +  single_succ_edge (bb)->count = bb->count;
> +}
> +
>  /* Function vect_do_peeling.
>  
>     Input:
> @@ -1690,7 +1701,18 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
>       may be preferred.  */
>    basic_block anchor = loop_preheader_edge (loop)->src;
>    if (skip_vector)
> -    split_edge (loop_preheader_edge (loop));
> +    {
> +      split_edge (loop_preheader_edge (loop));
> +
> +      /* Due to the order in which we peel prolog and epilog, we first
> +	 propagate probability to the whole loop.  The purpose is to
> +	 avoid adjusting probabilities of both prolog and vector loops
> +	 separately.  Note in this case, the probability of epilog loop
> +	 needs to be scaled back later.  */
> +      basic_block bb_before_loop = loop_preheader_edge (loop)->src;
> +      apply_probability_for_bb (bb_before_loop, prob_vector);
Aha, this is the bit I missed while trying to fix it myself.
I scale_bbs_frequencies_int(&bb_before_loop, 1, prob_vector, REG_BR_PROB_BASE)
to do this.  I plan to revamp API for this next stage1, but lets keep this consistent.
Path is OK with this change and ...
> +      scale_loop_profile (loop, prob_vector, bound);
... please try to check if scaling is really done reasonably.  From the above
it seems that the vectorized loop is unrealistically scalled down that may prevent
further optimization for speed...

Thanks for looking into this,
Honza

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH PR79347]Maintain profile counter information in vect_do_peeling
  2017-02-14 14:13 ` Jan Hubicka
@ 2017-02-14 14:19   ` Bin.Cheng
  2017-02-15 15:30     ` Bin.Cheng
  2017-02-14 21:41   ` Pat Haugen
  1 sibling, 1 reply; 6+ messages in thread
From: Bin.Cheng @ 2017-02-14 14:19 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: gcc-patches

On Tue, Feb 14, 2017 at 1:57 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> Thanks,
>> bin
>> 2017-02-13  Bin Cheng  <bin.cheng@arm.com>
>>
>>       PR tree-optimization/79347
>>       * tree-vect-loop-manip.c (apply_probability_for_bb): New function.
>>       (vect_do_peeling): Maintain profile counters during peeling.
>>
>> gcc/testsuite/ChangeLog
>> 2017-02-13  Bin Cheng  <bin.cheng@arm.com>
>>
>>       PR tree-optimization/79347
>>       * gcc.dg/vect/pr79347.c: New test.
>
>> diff --git a/gcc/testsuite/gcc.dg/vect/pr79347.c b/gcc/testsuite/gcc.dg/vect/pr79347.c
>> new file mode 100644
>> index 0000000..586c638
>> --- /dev/null
>> +++ b/gcc/testsuite/gcc.dg/vect/pr79347.c
>> @@ -0,0 +1,13 @@
>> +/* { dg-do compile } */
>> +/* { dg-require-effective-target vect_int } */
>> +/* { dg-additional-options "-fdump-tree-vect-all" } */
>> +
>> +short *a;
>> +int c;
>> +void n(void)
>> +{
>> +  for (int i = 0; i<c;i++)
>> +    a[i]++;
>> +}
>
> Thanks for fixing the prologue.  I think there is still one extra problem in the vectorizer.
> With the internal vectorized loop I now see:
>
> ;;   basic block 9, loop depth 1, count 0, freq 956, maybe hot
> ;;   Invalid sum of incoming frequencies 1961, should be 956
> ;;    prev block 8, next block 10, flags: (NEW, REACHABLE, VISITED)
> ;;    pred:       10 [100.0%]  (FALLTHRU,DFS_BACK,EXECUTABLE)
> ;;                8 [100.0%]  (FALLTHRU)
>   # i_18 = PHI <i_14(10), i_38(8)>
>   # vectp_a.13_66 = PHI <vectp_a.13_67(10), vectp_a.14_64(8)>
>   # vectp_a.19_75 = PHI <vectp_a.19_76(10), vectp_a.20_73(8)>
>   # ivtmp_78 = PHI <ivtmp_79(10), 0(8)>
>   _2 = (long unsigned int) i_18;
>   _3 = _2 * 2;
>   _4 = a.0_1 + _3;
>   vect__5.15_68 = MEM[(short int *)vectp_a.13_66];
>   _5 = *_4;
>   vect__6.16_69 = VIEW_CONVERT_EXPR<vector(8) unsigned short>(vect__5.15_68);
>   _6 = (unsigned short) _5;
>   vect__7.17_71 = vect__6.16_69 + vect_cst__70;
>   _7 = _6 + 1;
>   vect__8.18_72 = VIEW_CONVERT_EXPR<vector(8) short int>(vect__7.17_71);
>   _8 = (short int) _7;
>   MEM[(short int *)vectp_a.19_75] = vect__8.18_72;
>   i_14 = i_18 + 1;
>   vectp_a.13_67 = vectp_a.13_66 + 16;
>   vectp_a.19_76 = vectp_a.19_75 + 16;
>   ivtmp_79 = ivtmp_78 + 1;
>   if (ivtmp_79 < bnd.10_59)
>     goto <bb 10>; [85.00%]
>   else
>     goto <bb 11>; [15.00%]
>
> So it seems that the frequency of the loop itself is unrealistically scaled down.
> Before vetorizing the frequency is 8500 and predicted number of iterations is
> 6.6.  Now the loop is intereed via BB 8 with frequency 1148, so the loop, by
> exit probability exits with 15% probability and thus still has 6.6 iterations,
> but by BB frequencies its body executes fewer times than the preheader.
>
> Now this is a fragile area vectirizing loop should scale number of iterations down
> 8 times. However guessed CFG profiles are always very "flat". Of course
> if loop iterated 6.6 times at the average vectorizing would not make any sense.
> Making guessed profiles less flat is unrealistic, because average loop iterates few times,
> but of course while vectorizing we make additional guess that the vectorizable loops
> matters and the guessed profile is probably unrealistic.
That's what I mentioned in the original patch.  Vectorizer calls
scale_loop_profile in
function vect_transform_loop to scale down loop's frequency regardless mismatch
between loop and preheader/exit basic blocks.  In fact, after this
patch all mismatches
in vectorizer are introduced by this.  I don't see any way to keep
consistency beween
vectorized loop and the rest program without visiting whole CFG.  So
shall we skip
scaling down profile counters for vectorized loop?

>
> GCC 6 seems however bit more consistent.
>> +/* Apply probability PROB to basic block BB and its single succ edge.  */
>> +
>> +static void
>> +apply_probability_for_bb (basic_block bb, int prob)
>> +{
>> +  bb->frequency = apply_probability (bb->frequency, prob);
>> +  bb->count = apply_probability (bb->count, prob);
>> +  gcc_assert (single_succ_p (bb));
>> +  single_succ_edge (bb)->count = bb->count;
>> +}
>> +
>>  /* Function vect_do_peeling.
>>
>>     Input:
>> @@ -1690,7 +1701,18 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
>>       may be preferred.  */
>>    basic_block anchor = loop_preheader_edge (loop)->src;
>>    if (skip_vector)
>> -    split_edge (loop_preheader_edge (loop));
>> +    {
>> +      split_edge (loop_preheader_edge (loop));
>> +
>> +      /* Due to the order in which we peel prolog and epilog, we first
>> +      propagate probability to the whole loop.  The purpose is to
>> +      avoid adjusting probabilities of both prolog and vector loops
>> +      separately.  Note in this case, the probability of epilog loop
>> +      needs to be scaled back later.  */
>> +      basic_block bb_before_loop = loop_preheader_edge (loop)->src;
>> +      apply_probability_for_bb (bb_before_loop, prob_vector);
> Aha, this is the bit I missed while trying to fix it myself.
> I scale_bbs_frequencies_int(&bb_before_loop, 1, prob_vector, REG_BR_PROB_BASE)
> to do this.  I plan to revamp API for this next stage1, but lets keep this consistent.
Will change that.

> Path is OK with this change and ...
>> +      scale_loop_profile (loop, prob_vector, bound);
Can't do this with current interface because I need to scale up
frequency after skip_vector,
and sub-calls in scale_loop_profile checks validity for prob and only
allows scaling down...

Thanks,
bin
> ... please try to check if scaling is really done reasonably.  From the above
> it seems that the vectorized loop is unrealistically scalled down that may prevent
> further optimization for speed...
>
> Thanks for looking into this,
> Honza

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH PR79347]Maintain profile counter information in vect_do_peeling
  2017-02-14 14:13 ` Jan Hubicka
  2017-02-14 14:19   ` Bin.Cheng
@ 2017-02-14 21:41   ` Pat Haugen
  1 sibling, 0 replies; 6+ messages in thread
From: Pat Haugen @ 2017-02-14 21:41 UTC (permalink / raw)
  To: Jan Hubicka, Bin Cheng; +Cc: gcc-patches, nd

On 02/14/2017 07:57 AM, Jan Hubicka wrote:
> So it seems that the frequency of the loop itself is unrealistically scaled down.
> Before vetorizing the frequency is 8500 and predicted number of iterations is
> 6.6.  Now the loop is intereed via BB 8 with frequency 1148, so the loop, by
> exit probability exits with 15% probability and thus still has 6.6 iterations,
> but by BB frequencies its body executes fewer times than the preheader.
> 
> Now this is a fragile area vectirizing loop should scale number of iterations down
> 8 times. However guessed CFG profiles are always very "flat". Of course
> if loop iterated 6.6 times at the average vectorizing would not make any sense.
> Making guessed profiles less flat is unrealistic, because average loop iterates few times,
> but of course while vectorizing we make additional guess that the vectorizable loops
> matters and the guessed profile is probably unrealistic.

We have the same problem in the RTL loop unroller in that we'll scale the unrolled loop by the unroll factor (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68212#c3), which can result in a loop with lower frequency than surrounding code. Problem is compounded if we vectorize the loop and then unroll it. Whatever approach is decided for the case when we have guessed profile should be applied to both vectorizer and RTL loop unroller.

-Pat

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH PR79347]Maintain profile counter information in vect_do_peeling
  2017-02-14 14:19   ` Bin.Cheng
@ 2017-02-15 15:30     ` Bin.Cheng
  2017-02-15 17:07       ` Jan Hubicka
  0 siblings, 1 reply; 6+ messages in thread
From: Bin.Cheng @ 2017-02-15 15:30 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 6345 bytes --]

On Tue, Feb 14, 2017 at 2:13 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
> On Tue, Feb 14, 2017 at 1:57 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
>>> Thanks,
>>> bin
>>> 2017-02-13  Bin Cheng  <bin.cheng@arm.com>
>>>
>>>       PR tree-optimization/79347
>>>       * tree-vect-loop-manip.c (apply_probability_for_bb): New function.
>>>       (vect_do_peeling): Maintain profile counters during peeling.
>>>
>>> gcc/testsuite/ChangeLog
>>> 2017-02-13  Bin Cheng  <bin.cheng@arm.com>
>>>
>>>       PR tree-optimization/79347
>>>       * gcc.dg/vect/pr79347.c: New test.
>>
>>> diff --git a/gcc/testsuite/gcc.dg/vect/pr79347.c b/gcc/testsuite/gcc.dg/vect/pr79347.c
>>> new file mode 100644
>>> index 0000000..586c638
>>> --- /dev/null
>>> +++ b/gcc/testsuite/gcc.dg/vect/pr79347.c
>>> @@ -0,0 +1,13 @@
>>> +/* { dg-do compile } */
>>> +/* { dg-require-effective-target vect_int } */
>>> +/* { dg-additional-options "-fdump-tree-vect-all" } */
>>> +
>>> +short *a;
>>> +int c;
>>> +void n(void)
>>> +{
>>> +  for (int i = 0; i<c;i++)
>>> +    a[i]++;
>>> +}
>>
>> Thanks for fixing the prologue.  I think there is still one extra problem in the vectorizer.
>> With the internal vectorized loop I now see:
>>
>> ;;   basic block 9, loop depth 1, count 0, freq 956, maybe hot
>> ;;   Invalid sum of incoming frequencies 1961, should be 956
>> ;;    prev block 8, next block 10, flags: (NEW, REACHABLE, VISITED)
>> ;;    pred:       10 [100.0%]  (FALLTHRU,DFS_BACK,EXECUTABLE)
>> ;;                8 [100.0%]  (FALLTHRU)
>>   # i_18 = PHI <i_14(10), i_38(8)>
>>   # vectp_a.13_66 = PHI <vectp_a.13_67(10), vectp_a.14_64(8)>
>>   # vectp_a.19_75 = PHI <vectp_a.19_76(10), vectp_a.20_73(8)>
>>   # ivtmp_78 = PHI <ivtmp_79(10), 0(8)>
>>   _2 = (long unsigned int) i_18;
>>   _3 = _2 * 2;
>>   _4 = a.0_1 + _3;
>>   vect__5.15_68 = MEM[(short int *)vectp_a.13_66];
>>   _5 = *_4;
>>   vect__6.16_69 = VIEW_CONVERT_EXPR<vector(8) unsigned short>(vect__5.15_68);
>>   _6 = (unsigned short) _5;
>>   vect__7.17_71 = vect__6.16_69 + vect_cst__70;
>>   _7 = _6 + 1;
>>   vect__8.18_72 = VIEW_CONVERT_EXPR<vector(8) short int>(vect__7.17_71);
>>   _8 = (short int) _7;
>>   MEM[(short int *)vectp_a.19_75] = vect__8.18_72;
>>   i_14 = i_18 + 1;
>>   vectp_a.13_67 = vectp_a.13_66 + 16;
>>   vectp_a.19_76 = vectp_a.19_75 + 16;
>>   ivtmp_79 = ivtmp_78 + 1;
>>   if (ivtmp_79 < bnd.10_59)
>>     goto <bb 10>; [85.00%]
>>   else
>>     goto <bb 11>; [15.00%]
>>
>> So it seems that the frequency of the loop itself is unrealistically scaled down.
>> Before vetorizing the frequency is 8500 and predicted number of iterations is
>> 6.6.  Now the loop is intereed via BB 8 with frequency 1148, so the loop, by
>> exit probability exits with 15% probability and thus still has 6.6 iterations,
>> but by BB frequencies its body executes fewer times than the preheader.
>>
>> Now this is a fragile area vectirizing loop should scale number of iterations down
>> 8 times. However guessed CFG profiles are always very "flat". Of course
>> if loop iterated 6.6 times at the average vectorizing would not make any sense.
>> Making guessed profiles less flat is unrealistic, because average loop iterates few times,
>> but of course while vectorizing we make additional guess that the vectorizable loops
>> matters and the guessed profile is probably unrealistic.
> That's what I mentioned in the original patch.  Vectorizer calls
> scale_loop_profile in
> function vect_transform_loop to scale down loop's frequency regardless mismatch
> between loop and preheader/exit basic blocks.  In fact, after this
> patch all mismatches
> in vectorizer are introduced by this.  I don't see any way to keep
> consistency beween
> vectorized loop and the rest program without visiting whole CFG.  So
> shall we skip
> scaling down profile counters for vectorized loop?
>
>>
>> GCC 6 seems however bit more consistent.
>>> +/* Apply probability PROB to basic block BB and its single succ edge.  */
>>> +
>>> +static void
>>> +apply_probability_for_bb (basic_block bb, int prob)
>>> +{
>>> +  bb->frequency = apply_probability (bb->frequency, prob);
>>> +  bb->count = apply_probability (bb->count, prob);
>>> +  gcc_assert (single_succ_p (bb));
>>> +  single_succ_edge (bb)->count = bb->count;
>>> +}
>>> +
>>>  /* Function vect_do_peeling.
>>>
>>>     Input:
>>> @@ -1690,7 +1701,18 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
>>>       may be preferred.  */
>>>    basic_block anchor = loop_preheader_edge (loop)->src;
>>>    if (skip_vector)
>>> -    split_edge (loop_preheader_edge (loop));
>>> +    {
>>> +      split_edge (loop_preheader_edge (loop));
>>> +
>>> +      /* Due to the order in which we peel prolog and epilog, we first
>>> +      propagate probability to the whole loop.  The purpose is to
>>> +      avoid adjusting probabilities of both prolog and vector loops
>>> +      separately.  Note in this case, the probability of epilog loop
>>> +      needs to be scaled back later.  */
>>> +      basic_block bb_before_loop = loop_preheader_edge (loop)->src;
>>> +      apply_probability_for_bb (bb_before_loop, prob_vector);
>> Aha, this is the bit I missed while trying to fix it myself.
>> I scale_bbs_frequencies_int(&bb_before_loop, 1, prob_vector, REG_BR_PROB_BASE)
>> to do this.  I plan to revamp API for this next stage1, but lets keep this consistent.
> Will change that.
>
>> Path is OK with this change and ...
>>> +      scale_loop_profile (loop, prob_vector, bound);
> Can't do this with current interface because I need to scale up
> frequency after skip_vector,
> and sub-calls in scale_loop_profile checks validity for prob and only
> allows scaling down...

Hi,
Attachment is the updated patch.  Bootstrap and tested.  Is it OK?

Thanks,
bin

2017-02-13  Bin Cheng  <bin.cheng@arm.com>

    PR tree-optimization/79347
    * tree-vect-loop-manip.c (vect_do_peeling): Maintain profile
    counters during peeling.

gcc/testsuite/ChangeLog
2017-02-13  Bin Cheng  <bin.cheng@arm.com>

    PR tree-optimization/79347
    * gcc.dg/vect/pr79347.c: New test.

>
> Thanks,
> bin
>> ... please try to check if scaling is really done reasonably.  From the above
>> it seems that the vectorized loop is unrealistically scalled down that may prevent
>> further optimization for speed...
>>
>> Thanks for looking into this,
>> Honza

[-- Attachment #2: pr79347-20170210.txt --]
[-- Type: text/plain, Size: 4240 bytes --]

diff --git a/gcc/testsuite/gcc.dg/vect/pr79347.c b/gcc/testsuite/gcc.dg/vect/pr79347.c
new file mode 100644
index 0000000..586c638
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/pr79347.c
@@ -0,0 +1,13 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target vect_int } */
+/* { dg-additional-options "-fdump-tree-vect-all" } */
+
+short *a;
+int c;
+void n(void)
+{
+  for (int i = 0; i<c;i++)
+    a[i]++;
+}
+
+/* { dg-final { scan-tree-dump-times "Invalid sum of " 2 "vect" } } */
diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c
index f29449c..5ee2c38 100644
--- a/gcc/tree-vect-loop-manip.c
+++ b/gcc/tree-vect-loop-manip.c
@@ -1690,7 +1690,19 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
      may be preferred.  */
   basic_block anchor = loop_preheader_edge (loop)->src;
   if (skip_vector)
-    split_edge (loop_preheader_edge (loop));
+    {
+      split_edge (loop_preheader_edge (loop));
+
+      /* Due to the order in which we peel prolog and epilog, we first
+	 propagate probability to the whole loop.  The purpose is to
+	 avoid adjusting probabilities of both prolog and vector loops
+	 separately.  Note in this case, the probability of epilog loop
+	 needs to be scaled back later.  */
+      basic_block bb_before_loop = loop_preheader_edge (loop)->src;
+      scale_bbs_frequencies_int (&bb_before_loop, 1, prob_vector,
+				 REG_BR_PROB_BASE);
+      scale_loop_profile (loop, prob_vector, bound);
+    }
 
   tree niters_prolog = build_int_cst (type, 0);
   source_location loop_loc = find_loop_location (loop);
@@ -1727,6 +1739,7 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
 	  guard_cond = fold_build2 (EQ_EXPR, boolean_type_node,
 				    niters_prolog, build_int_cst (type, 0));
 	  guard_bb = loop_preheader_edge (prolog)->src;
+	  basic_block bb_after_prolog = loop_preheader_edge (loop)->src;
 	  guard_to = split_edge (loop_preheader_edge (loop));
 	  guard_e = slpeel_add_loop_guard (guard_bb, guard_cond,
 					   guard_to, guard_bb,
@@ -1734,6 +1747,9 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
 	  e = EDGE_PRED (guard_to, 0);
 	  e = (e != guard_e ? e : EDGE_PRED (guard_to, 1));
 	  slpeel_update_phi_nodes_for_guard1 (prolog, loop, guard_e, e);
+
+	  scale_bbs_frequencies_int (&bb_after_prolog, 1, prob_prolog,
+				     REG_BR_PROB_BASE);
 	  scale_loop_profile (prolog, prob_prolog, bound_prolog);
 	}
       /* Update init address of DRs.  */
@@ -1796,9 +1812,18 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
 	  e = EDGE_PRED (guard_to, 0);
 	  e = (e != guard_e ? e : EDGE_PRED (guard_to, 1));
 	  slpeel_update_phi_nodes_for_guard1 (first_loop, epilog, guard_e, e);
-	  scale_loop_profile (epilog, prob_vector, bound_scalar);
+
+	  /* Simply propagate profile info from guard_bb to guard_to which is
+	     a merge point of control flow.  */
+	  guard_to->frequency = guard_bb->frequency;
+	  guard_to->count = guard_bb->count;
+	  single_succ_edge (guard_to)->count = guard_to->count;
+	  /* Scale probability of epilog loop back.  */
+	  int scale_up = REG_BR_PROB_BASE * REG_BR_PROB_BASE / prob_vector;
+	  scale_loop_frequencies (epilog, scale_up, REG_BR_PROB_BASE);
 	}
 
+      basic_block bb_before_epilog = loop_preheader_edge (epilog)->src;
       tree niters_vector_mult_vf;
       /* If loop is peeled for non-zero constant times, now niters refers to
 	 orig_niters - prolog_peeling, it won't overflow even the orig_niters
@@ -1826,6 +1851,16 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
 					   inverse_probability (prob_epilog));
 	  slpeel_update_phi_nodes_for_guard2 (loop, epilog, guard_e,
 					      single_exit (epilog));
+	  /* Only need to handle basic block before epilog loop if it's not
+	     the guard_bb, which is the case when skip_vector is true.  */
+	  if (guard_bb != bb_before_epilog)
+	    {
+	      prob_epilog = (combine_probabilities (prob_vector, prob_epilog)
+			     + inverse_probability (prob_vector));
+
+	      scale_bbs_frequencies_int (&bb_before_epilog, 1, prob_epilog,
+					 REG_BR_PROB_BASE);
+	    }
 	  scale_loop_profile (epilog, prob_epilog, bound);
 	}
       else

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH PR79347]Maintain profile counter information in vect_do_peeling
  2017-02-15 15:30     ` Bin.Cheng
@ 2017-02-15 17:07       ` Jan Hubicka
  0 siblings, 0 replies; 6+ messages in thread
From: Jan Hubicka @ 2017-02-15 17:07 UTC (permalink / raw)
  To: Bin.Cheng; +Cc: Jan Hubicka, gcc-patches

> On Tue, Feb 14, 2017 at 2:13 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
> > On Tue, Feb 14, 2017 at 1:57 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
> >>> Thanks,
> >>> bin
> >>> 2017-02-13  Bin Cheng  <bin.cheng@arm.com>
> >>>
> >>>       PR tree-optimization/79347
> >>>       * tree-vect-loop-manip.c (apply_probability_for_bb): New function.
> >>>       (vect_do_peeling): Maintain profile counters during peeling.
> >>>
> >>> gcc/testsuite/ChangeLog
> >>> 2017-02-13  Bin Cheng  <bin.cheng@arm.com>
> >>>
> >>>       PR tree-optimization/79347
> >>>       * gcc.dg/vect/pr79347.c: New test.
> >>
> >>> diff --git a/gcc/testsuite/gcc.dg/vect/pr79347.c b/gcc/testsuite/gcc.dg/vect/pr79347.c
> >>> new file mode 100644
> >>> index 0000000..586c638
> >>> --- /dev/null
> >>> +++ b/gcc/testsuite/gcc.dg/vect/pr79347.c
> >>> @@ -0,0 +1,13 @@
> >>> +/* { dg-do compile } */
> >>> +/* { dg-require-effective-target vect_int } */
> >>> +/* { dg-additional-options "-fdump-tree-vect-all" } */
> >>> +
> >>> +short *a;
> >>> +int c;
> >>> +void n(void)
> >>> +{
> >>> +  for (int i = 0; i<c;i++)
> >>> +    a[i]++;
> >>> +}
> >>
> >> Thanks for fixing the prologue.  I think there is still one extra problem in the vectorizer.
> >> With the internal vectorized loop I now see:
> >>
> >> ;;   basic block 9, loop depth 1, count 0, freq 956, maybe hot
> >> ;;   Invalid sum of incoming frequencies 1961, should be 956
> >> ;;    prev block 8, next block 10, flags: (NEW, REACHABLE, VISITED)
> >> ;;    pred:       10 [100.0%]  (FALLTHRU,DFS_BACK,EXECUTABLE)
> >> ;;                8 [100.0%]  (FALLTHRU)
> >>   # i_18 = PHI <i_14(10), i_38(8)>
> >>   # vectp_a.13_66 = PHI <vectp_a.13_67(10), vectp_a.14_64(8)>
> >>   # vectp_a.19_75 = PHI <vectp_a.19_76(10), vectp_a.20_73(8)>
> >>   # ivtmp_78 = PHI <ivtmp_79(10), 0(8)>
> >>   _2 = (long unsigned int) i_18;
> >>   _3 = _2 * 2;
> >>   _4 = a.0_1 + _3;
> >>   vect__5.15_68 = MEM[(short int *)vectp_a.13_66];
> >>   _5 = *_4;
> >>   vect__6.16_69 = VIEW_CONVERT_EXPR<vector(8) unsigned short>(vect__5.15_68);
> >>   _6 = (unsigned short) _5;
> >>   vect__7.17_71 = vect__6.16_69 + vect_cst__70;
> >>   _7 = _6 + 1;
> >>   vect__8.18_72 = VIEW_CONVERT_EXPR<vector(8) short int>(vect__7.17_71);
> >>   _8 = (short int) _7;
> >>   MEM[(short int *)vectp_a.19_75] = vect__8.18_72;
> >>   i_14 = i_18 + 1;
> >>   vectp_a.13_67 = vectp_a.13_66 + 16;
> >>   vectp_a.19_76 = vectp_a.19_75 + 16;
> >>   ivtmp_79 = ivtmp_78 + 1;
> >>   if (ivtmp_79 < bnd.10_59)
> >>     goto <bb 10>; [85.00%]
> >>   else
> >>     goto <bb 11>; [15.00%]
> >>
> >> So it seems that the frequency of the loop itself is unrealistically scaled down.
> >> Before vetorizing the frequency is 8500 and predicted number of iterations is
> >> 6.6.  Now the loop is intereed via BB 8 with frequency 1148, so the loop, by
> >> exit probability exits with 15% probability and thus still has 6.6 iterations,
> >> but by BB frequencies its body executes fewer times than the preheader.
> >>
> >> Now this is a fragile area vectirizing loop should scale number of iterations down
> >> 8 times. However guessed CFG profiles are always very "flat". Of course
> >> if loop iterated 6.6 times at the average vectorizing would not make any sense.
> >> Making guessed profiles less flat is unrealistic, because average loop iterates few times,
> >> but of course while vectorizing we make additional guess that the vectorizable loops
> >> matters and the guessed profile is probably unrealistic.
> > That's what I mentioned in the original patch.  Vectorizer calls
> > scale_loop_profile in
> > function vect_transform_loop to scale down loop's frequency regardless mismatch
> > between loop and preheader/exit basic blocks.  In fact, after this
> > patch all mismatches
> > in vectorizer are introduced by this.  I don't see any way to keep
> > consistency beween
> > vectorized loop and the rest program without visiting whole CFG.  So
> > shall we skip
> > scaling down profile counters for vectorized loop?
> >
> >>
> >> GCC 6 seems however bit more consistent.
> >>> +/* Apply probability PROB to basic block BB and its single succ edge.  */
> >>> +
> >>> +static void
> >>> +apply_probability_for_bb (basic_block bb, int prob)
> >>> +{
> >>> +  bb->frequency = apply_probability (bb->frequency, prob);
> >>> +  bb->count = apply_probability (bb->count, prob);
> >>> +  gcc_assert (single_succ_p (bb));
> >>> +  single_succ_edge (bb)->count = bb->count;
> >>> +}
> >>> +
> >>>  /* Function vect_do_peeling.
> >>>
> >>>     Input:
> >>> @@ -1690,7 +1701,18 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
> >>>       may be preferred.  */
> >>>    basic_block anchor = loop_preheader_edge (loop)->src;
> >>>    if (skip_vector)
> >>> -    split_edge (loop_preheader_edge (loop));
> >>> +    {
> >>> +      split_edge (loop_preheader_edge (loop));
> >>> +
> >>> +      /* Due to the order in which we peel prolog and epilog, we first
> >>> +      propagate probability to the whole loop.  The purpose is to
> >>> +      avoid adjusting probabilities of both prolog and vector loops
> >>> +      separately.  Note in this case, the probability of epilog loop
> >>> +      needs to be scaled back later.  */
> >>> +      basic_block bb_before_loop = loop_preheader_edge (loop)->src;
> >>> +      apply_probability_for_bb (bb_before_loop, prob_vector);
> >> Aha, this is the bit I missed while trying to fix it myself.
> >> I scale_bbs_frequencies_int(&bb_before_loop, 1, prob_vector, REG_BR_PROB_BASE)
> >> to do this.  I plan to revamp API for this next stage1, but lets keep this consistent.
> > Will change that.
> >
> >> Path is OK with this change and ...
> >>> +      scale_loop_profile (loop, prob_vector, bound);
> > Can't do this with current interface because I need to scale up
> > frequency after skip_vector,
> > and sub-calls in scale_loop_profile checks validity for prob and only
> > allows scaling down...
> 
> Hi,
> Attachment is the updated patch.  Bootstrap and tested.  Is it OK?
> 
> Thanks,
> bin
> 
> 2017-02-13  Bin Cheng  <bin.cheng@arm.com>
> 
>     PR tree-optimization/79347
>     * tree-vect-loop-manip.c (vect_do_peeling): Maintain profile
>     counters during peeling.
> 
> gcc/testsuite/ChangeLog
> 2017-02-13  Bin Cheng  <bin.cheng@arm.com>
> 
>     PR tree-optimization/79347
>     * gcc.dg/vect/pr79347.c: New test.

OK, thanks!

Honza

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2017-02-15 17:03 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-02-14 13:09 [PATCH PR79347]Maintain profile counter information in vect_do_peeling Bin Cheng
2017-02-14 14:13 ` Jan Hubicka
2017-02-14 14:19   ` Bin.Cheng
2017-02-15 15:30     ` Bin.Cheng
2017-02-15 17:07       ` Jan Hubicka
2017-02-14 21:41   ` Pat Haugen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).