[PATCH PR62178]Improve candidate selecting in IVOPT, 2nd try.

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* [PATCH PR62178]Improve candidate selecting in IVOPT, 2nd try.
@ 2014-12-05 12:15 Bin Cheng
  2014-12-09 22:58 ` Jeff Law
  2014-12-10 13:47 ` Richard Biener
  0 siblings, 2 replies; 14+ messages in thread
From: Bin Cheng @ 2014-12-05 12:15 UTC (permalink / raw)
  To: gcc-patches; +Cc: ook

[-- Attachment #1: Type: text/plain, Size: 2100 bytes --]

Hi,
Though PR62178 is hidden by recent cost change in aarch64 backend, the ivopt
issue still exists.

Current candidate selecting algorithm tends to select fewer candidates given
below reasons:
  1) to better handle loops with many induction uses but the best choice is
one generic basic induction variable;
  2) to keep compilation time low.

One fundamental weakness of the strategy is the opposite situation can't be
handled properly sometimes.  For these cases the best choice is each
induction variable has its own candidate.
This patch fixes the problem by shuffling candidate set after fix-point is
reached by current implementation.  The reason why this strategy works is it
replaces candidate set by selecting local optimal candidate for some
induction uses, and the new candidate set (has lower cost) is exact what we
want in the mentioned case.  Instrumentation data shows this can find better
candidates set for ~6% loops in spec2006 on x86_64, and ~4% on aarch64.

This patch actually is extension to the first version patch posted at
https://gcc.gnu.org/ml/gcc-patches/2014-09/msg02620.html, that only adds
another selecting pass with special seed set (more or less like the shuffled
set in this patch).  Data also confirms this patch can find optimal sets for
most loops found by the first one, as well as optimal sets for many new
loops.

Bootstrap and test on x86_64, no regression on benchmarks.  Bootstrap and
test on aarch64.
Since this patch only selects candidate set with lower cost, any regressions
revealed are latent bugs of other components in GCC.
I also collected GCC bootstrap time on x86_64, no regression either.
Is this OK?

2014-12-03  Bin Cheng  bin.cheng@arm.com

  PR tree-optimization/62178
  * tree-ssa-loop-ivopts.c (iv_ca_replace): New function.
  (try_improve_iv_set): Shuffle candidates set in order to handle
  case in which candidate wrto each iv use should be selected.

gcc/testsuite/ChangeLog
2014-12-03  Bin Cheng  bin.cheng@arm.com

  PR tree-optimization/62178
  * gcc.target/aarch64/pr62178.c: New test.

[-- Attachment #2: pr62178-20141202.txt --]
[-- Type: text/plain, Size: 4850 bytes --]

Index: gcc/tree-ssa-loop-ivopts.c
===================================================================
--- gcc/tree-ssa-loop-ivopts.c	(revision 217828)
+++ gcc/tree-ssa-loop-ivopts.c	(working copy)
@@ -5718,6 +5718,85 @@ iv_ca_extend (struct ivopts_data *data, struct iv_
   return cost;
 }
 
+/* Try replacing candidates in IVS which are recorded by list ACT_DELTA to
+   lower cost candidates.  CAND is the one won't be replaced.  Replacement
+   of candidate is recorded in list DELTA.  */
+
+static void
+iv_ca_replace (struct ivopts_data *data, struct iv_ca *ivs,
+	       struct iv_cand *cand, struct iv_ca_delta *act_delta,
+	       struct iv_ca_delta **delta)
+{
+  unsigned int i, j;
+  bitmap_iterator bi;
+  struct iv_use *use;
+  struct iv_cand *cnd;
+  bool should_replace;
+  struct iv_ca_delta *act;
+  struct cost_pair *old_cp, *best_cp = NULL, *cp;
+
+  *delta = NULL;
+  for (i = 0; i < ivs->upto; i++)
+    {
+      use = iv_use (data, i);
+
+      old_cp = iv_ca_cand_for_use (ivs, use);
+      if (old_cp->cand == cand)
+	continue;
+
+      should_replace = false;
+      for (act = act_delta; act; act = act->next_change)
+	if (old_cp->cand == act->old_cp->cand)
+	  {
+	    should_replace = true;
+	    break;
+	  }
+      if (!should_replace)
+	continue;
+
+      best_cp = NULL;
+      if (data->consider_all_candidates)
+	{
+	  for (j = 0; j < n_iv_cands (data); j++)
+	    {
+	      if (j == old_cp->cand->id)
+		continue;
+
+	      cnd = iv_cand (data, j);
+	      cp = get_use_iv_cost (data, use, cnd);
+	      if (!cp)
+		continue;
+
+	      if (best_cp == NULL || cheaper_cost_pair (cp, best_cp))
+		best_cp = cp;
+	    }
+	}
+      else
+	{
+	  EXECUTE_IF_SET_IN_BITMAP (use->related_cands, 0, j, bi)
+	    {
+	      if (j == old_cp->cand->id)
+		continue;
+
+	      cnd = iv_cand (data, j);
+	      cp = get_use_iv_cost (data, use, cnd);
+	      if (!cp)
+		continue;
+
+	      if (best_cp == NULL || cheaper_cost_pair (cp, best_cp))
+		best_cp = cp;
+	    }
+	}
+
+      if (!best_cp)
+	continue;
+
+      *delta = iv_ca_delta_add (use, old_cp, best_cp, *delta);
+    }
+
+  return;
+}
+
 /* Try narrowing set IVS by removing CAND.  Return the cost of
    the new set and store the differences in DELTA.  START is
    the candidate with which we start narrowing.  */
@@ -6042,8 +6121,50 @@ try_improve_iv_set (struct ivopts_data *data, stru
       /* Try removing the candidates from the set instead.  */
       best_cost = iv_ca_prune (data, ivs, NULL, &best_delta);
 
-      /* Nothing more we can do.  */
       if (!best_delta)
+	{
+	  /* So far candidate selecting algorithm tends to choose fewer IVs
+	     so that it can handle cases in which loops have many variables
+	     but the best choice is often to use only one general biv.  One
+	     weakness is it can't handle opposite cases, in which different
+	     candidates should be chosen with respect to each use.  To solve
+	     the problem, we replace candidate of some uses with lower cost
+	     one, thus general algorithm can have a chance to find optimal
+	     set for these cases.  */
+	  for (i = 0; i < n_iv_cands (data); i++)
+	    {
+	      cand = iv_cand (data, i);
+	      if (iv_ca_cand_used_p (ivs, cand))
+		continue;
+
+	      acost = iv_ca_extend (data, ivs, cand, &act_delta, NULL, false);
+	      if (!act_delta)
+		continue;
+
+	      iv_ca_delta_commit (data, ivs, act_delta, true);
+	      iv_ca_replace (data, ivs, cand, act_delta, &tmp_delta);
+	      if (tmp_delta)
+		{
+		  iv_ca_delta_commit (data, ivs, tmp_delta, true);
+		  act_delta = iv_ca_delta_join (act_delta, tmp_delta);
+		  acost = iv_ca_prune (data, ivs, cand, &tmp_delta);
+		}
+
+	      iv_ca_delta_commit (data, ivs, act_delta, false);
+	      act_delta = iv_ca_delta_join (act_delta, tmp_delta);
+
+	      if (compare_costs (acost, best_cost) < 0)
+		{
+		  best_cost = acost;
+		  iv_ca_delta_free (&best_delta);
+		  best_delta = act_delta;
+		}
+	      else
+		iv_ca_delta_free (&act_delta);
+	    }
+	}
+
+      if (!best_delta)
 	return false;
     }
 
Index: gcc/testsuite/gcc.target/aarch64/pr62178.c
===================================================================
--- gcc/testsuite/gcc.target/aarch64/pr62178.c	(revision 0)
+++ gcc/testsuite/gcc.target/aarch64/pr62178.c	(revision 0)
@@ -0,0 +1,17 @@
+/* { dg-do compile } */
+/* { dg-options "-O3" } */
+
+int a[30 +1][30 +1], b[30 +1][30 +1], r[30 +1][30 +1];
+
+void foo (void) {
+  int i, j, k;
+
+  for ( i = 1; i <= 30; i++ )
+    for ( j = 1; j <= 30; j++ ) {
+      r[i][j] = 0;
+      for(k = 1; k <= 30; k++ )
+        r[i][j] += a[i][k]*b[k][j];
+    }
+}
+
+/* { dg-final { scan-assembler "ld1r\\t\{v\[0-9\]+\."} } */

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH PR62178]Improve candidate selecting in IVOPT, 2nd try.
  2014-12-05 12:15 [PATCH PR62178]Improve candidate selecting in IVOPT, 2nd try Bin Cheng
@ 2014-12-09 22:58 ` Jeff Law
  2014-12-10  6:10   ` Bin.Cheng
  2014-12-10 13:47 ` Richard Biener
  1 sibling, 1 reply; 14+ messages in thread
From: Jeff Law @ 2014-12-09 22:58 UTC (permalink / raw)
  To: Bin Cheng, gcc-patches; +Cc: ook

On 12/05/14 05:15, Bin Cheng wrote:
> Hi,
> Though PR62178 is hidden by recent cost change in aarch64 backend, the ivopt
> issue still exists.
>
> Current candidate selecting algorithm tends to select fewer candidates given
> below reasons:
>    1) to better handle loops with many induction uses but the best choice is
> one generic basic induction variable;
>    2) to keep compilation time low.
>
> One fundamental weakness of the strategy is the opposite situation can't be
> handled properly sometimes.  For these cases the best choice is each
> induction variable has its own candidate.
> This patch fixes the problem by shuffling candidate set after fix-point is
> reached by current implementation.  The reason why this strategy works is it
> replaces candidate set by selecting local optimal candidate for some
> induction uses, and the new candidate set (has lower cost) is exact what we
> want in the mentioned case.  Instrumentation data shows this can find better
> candidates set for ~6% loops in spec2006 on x86_64, and ~4% on aarch64.
>
> This patch actually is extension to the first version patch posted at
> https://gcc.gnu.org/ml/gcc-patches/2014-09/msg02620.html, that only adds
> another selecting pass with special seed set (more or less like the shuffled
> set in this patch).  Data also confirms this patch can find optimal sets for
> most loops found by the first one, as well as optimal sets for many new
> loops.
>
> Bootstrap and test on x86_64, no regression on benchmarks.  Bootstrap and
> test on aarch64.
> Since this patch only selects candidate set with lower cost, any regressions
> revealed are latent bugs of other components in GCC.
> I also collected GCC bootstrap time on x86_64, no regression either.
> Is this OK?
>
> 2014-12-03  Bin Chengbin.cheng@arm.com
>
>    PR tree-optimization/62178
>    * tree-ssa-loop-ivopts.c (iv_ca_replace): New function.
>    (try_improve_iv_set): Shuffle candidates set in order to handle
>    case in which candidate wrto each iv use should be selected.
>
> gcc/testsuite/ChangeLog
> 2014-12-03  Bin Chengbin.cheng@arm.com
>
>    PR tree-optimization/62178
>    * gcc.target/aarch64/pr62178.c: New test.
>
>
> pr62178-20141202.txt
>
>
> Index: gcc/tree-ssa-loop-ivopts.c
> ===================================================================
> --- gcc/tree-ssa-loop-ivopts.c	(revision 217828)
> +++ gcc/tree-ssa-loop-ivopts.c	(working copy)
> @@ -5718,6 +5718,85 @@ iv_ca_extend (struct ivopts_data *data, struct iv_
>     return cost;
>   }
>
> +/* Try replacing candidates in IVS which are recorded by list ACT_DELTA to
> +   lower cost candidates.  CAND is the one won't be replaced.  Replacement
> +   of candidate is recorded in list DELTA.  */
Is this better written as:

Try replacing candidates in IVS with IVs related to CAND (which is not
changed) if doing so lowers the IV cost.  ACT_DELTA is the recorded
list of candidates.  Replacement of candidate is recorded in list DELTA.

?


> +      if (data->consider_all_candidates)
> +	{
> +	  for (j = 0; j < n_iv_cands (data); j++)
> +	    {
> +	      if (j == old_cp->cand->id)
> +		continue;
> +
> +	      cnd = iv_cand (data, j);
> +	      cp = get_use_iv_cost (data, use, cnd);
> +	      if (!cp)
> +		continue;
> +
> +	      if (best_cp == NULL || cheaper_cost_pair (cp, best_cp))
> +		best_cp = cp;
> +	    }
> +	}
> +      else
> +	{
> +	  EXECUTE_IF_SET_IN_BITMAP (use->related_cands, 0, j, bi)
> +	    {
> +	      if (j == old_cp->cand->id)
> +		continue;
> +
> +	      cnd = iv_cand (data, j);
> +	      cp = get_use_iv_cost (data, use, cnd);
> +	      if (!cp)
> +		continue;
> +
> +	      if (best_cp == NULL || cheaper_cost_pair (cp, best_cp))
> +		best_cp = cp;
> +	    }
> +	}
The loop bodies here are duplicated.  Can you factor them into a 
function so that this looks something like

if (data->consider_all_candidates)
   {
     for (...)
       refactored_code (some arguments)
   }
else
   {
      EXECUTE_IF_SET_IN_BITMAP (...)
        refactored_code (some arguments)

> @@ -6042,8 +6121,50 @@ try_improve_iv_set (struct ivopts_data *data, stru
>         /* Try removing the candidates from the set instead.  */
>         best_cost = iv_ca_prune (data, ivs, NULL, &best_delta);
>
> -      /* Nothing more we can do.  */
>         if (!best_delta)
> +	{
> +	  /* So far candidate selecting algorithm tends to choose fewer IVs
> +	     so that it can handle cases in which loops have many variables
> +	     but the best choice is often to use only one general biv.  One
> +	     weakness is it can't handle opposite cases, in which different
> +	     candidates should be chosen with respect to each use.  To solve
> +	     the problem, we replace candidate of some uses with lower cost
> +	     one, thus general algorithm can have a chance to find optimal
> +	     set for these cases.  */
So, in essence we've computed a best cost with "minimal" IVs and you're 
using that result as an initial state for expanding the IV set.  You 
expand on a candidate by candidate basis if and only if the estimated 
cost lowers.  Right?

At each expansion step you keep a single candidate fixed and you try to 
derive other IVs from that fixed IV if doing so lowers the cost?  Right?


That's what it looks like to me when reading the code/comments.  Just 
want to make sure it's working the way I think it is.

FWIW, if I read the slides correctly, GCC's IV code was called out as 
being one of the reasons GCC is generating better code for AArch64 than 
LLVM at a recent LLVM conference.  Glad to see that we're called out in 
that way and that we continue to try to improve in that space as well.

Jeff

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH PR62178]Improve candidate selecting in IVOPT, 2nd try.
  2014-12-09 22:58 ` Jeff Law
@ 2014-12-10  6:10   ` Bin.Cheng
  0 siblings, 0 replies; 14+ messages in thread
From: Bin.Cheng @ 2014-12-10  6:10 UTC (permalink / raw)
  To: Jeff Law; +Cc: Bin Cheng, gcc-patches List, Zdenek Dvorak

On Wed, Dec 10, 2014 at 6:58 AM, Jeff Law <law@redhat.com> wrote:
> On 12/05/14 05:15, Bin Cheng wrote:
>>
>> Hi,
>> Though PR62178 is hidden by recent cost change in aarch64 backend, the
>> ivopt
>> issue still exists.
>>
>> Current candidate selecting algorithm tends to select fewer candidates
>> given
>> below reasons:
>>    1) to better handle loops with many induction uses but the best choice
>> is
>> one generic basic induction variable;
>>    2) to keep compilation time low.
>>
>> One fundamental weakness of the strategy is the opposite situation can't
>> be
>> handled properly sometimes.  For these cases the best choice is each
>> induction variable has its own candidate.
>> This patch fixes the problem by shuffling candidate set after fix-point is
>> reached by current implementation.  The reason why this strategy works is
>> it
>> replaces candidate set by selecting local optimal candidate for some
>> induction uses, and the new candidate set (has lower cost) is exact what
>> we
>> want in the mentioned case.  Instrumentation data shows this can find
>> better
>> candidates set for ~6% loops in spec2006 on x86_64, and ~4% on aarch64.
>>
>> This patch actually is extension to the first version patch posted at
>> https://gcc.gnu.org/ml/gcc-patches/2014-09/msg02620.html, that only adds
>> another selecting pass with special seed set (more or less like the
>> shuffled
>> set in this patch).  Data also confirms this patch can find optimal sets
>> for
>> most loops found by the first one, as well as optimal sets for many new
>> loops.
>>
>> Bootstrap and test on x86_64, no regression on benchmarks.  Bootstrap and
>> test on aarch64.
>> Since this patch only selects candidate set with lower cost, any
>> regressions
>> revealed are latent bugs of other components in GCC.
>> I also collected GCC bootstrap time on x86_64, no regression either.
>> Is this OK?
>>
>> 2014-12-03  Bin Chengbin.cheng@arm.com
>>
>>    PR tree-optimization/62178
>>    * tree-ssa-loop-ivopts.c (iv_ca_replace): New function.
>>    (try_improve_iv_set): Shuffle candidates set in order to handle
>>    case in which candidate wrto each iv use should be selected.
>>
>> gcc/testsuite/ChangeLog
>> 2014-12-03  Bin Chengbin.cheng@arm.com
>>
>>    PR tree-optimization/62178
>>    * gcc.target/aarch64/pr62178.c: New test.
>>
>>
>> pr62178-20141202.txt
>>
>>
>> Index: gcc/tree-ssa-loop-ivopts.c
>> ===================================================================
>> --- gcc/tree-ssa-loop-ivopts.c  (revision 217828)
>> +++ gcc/tree-ssa-loop-ivopts.c  (working copy)
>> @@ -5718,6 +5718,85 @@ iv_ca_extend (struct ivopts_data *data, struct iv_
>>     return cost;
>>   }
>>
>> +/* Try replacing candidates in IVS which are recorded by list ACT_DELTA
>> to
>> +   lower cost candidates.  CAND is the one won't be replaced.
>> Replacement
>> +   of candidate is recorded in list DELTA.  */

Thanks for the review.

>
> Is this better written as:
>
> Try replacing candidates in IVS with IVs related to CAND (which is not
> changed) if doing so lowers the IV cost.  ACT_DELTA is the recorded
> list of candidates.  Replacement of candidate is recorded in list DELTA.
>
> ?

Will refine that.

>
>
>> +      if (data->consider_all_candidates)
>> +       {
>> +         for (j = 0; j < n_iv_cands (data); j++)
>> +           {
>> +             if (j == old_cp->cand->id)
>> +               continue;
>> +
>> +             cnd = iv_cand (data, j);
>> +             cp = get_use_iv_cost (data, use, cnd);
>> +             if (!cp)
>> +               continue;
>> +
>> +             if (best_cp == NULL || cheaper_cost_pair (cp, best_cp))
>> +               best_cp = cp;
>> +           }
>> +       }
>> +      else
>> +       {
>> +         EXECUTE_IF_SET_IN_BITMAP (use->related_cands, 0, j, bi)
>> +           {
>> +             if (j == old_cp->cand->id)
>> +               continue;
>> +
>> +             cnd = iv_cand (data, j);
>> +             cp = get_use_iv_cost (data, use, cnd);
>> +             if (!cp)
>> +               continue;
>> +
>> +             if (best_cp == NULL || cheaper_cost_pair (cp, best_cp))
>> +               best_cp = cp;
>> +           }
>> +       }
>
> The loop bodies here are duplicated.  Can you factor them into a function so
> that this looks something like
>
> if (data->consider_all_candidates)
>   {
>     for (...)
>       refactored_code (some arguments)
>   }
> else
>   {
>      EXECUTE_IF_SET_IN_BITMAP (...)
>        refactored_code (some arguments)
>

Will do that.

>> @@ -6042,8 +6121,50 @@ try_improve_iv_set (struct ivopts_data *data, stru
>>         /* Try removing the candidates from the set instead.  */
>>         best_cost = iv_ca_prune (data, ivs, NULL, &best_delta);
>>
>> -      /* Nothing more we can do.  */
>>         if (!best_delta)
>> +       {
>> +         /* So far candidate selecting algorithm tends to choose fewer
>> IVs
>> +            so that it can handle cases in which loops have many
>> variables
>> +            but the best choice is often to use only one general biv.
>> One
>> +            weakness is it can't handle opposite cases, in which
>> different
>> +            candidates should be chosen with respect to each use.  To
>> solve
>> +            the problem, we replace candidate of some uses with lower
>> cost
>> +            one, thus general algorithm can have a chance to find optimal
>> +            set for these cases.  */
>
> So, in essence we've computed a best cost with "minimal" IVs and you're
> using that result as an initial state for expanding the IV set.  You expand
> on a candidate by candidate basis if and only if the estimated cost lowers.
> Right?
Yes.

>
> At each expansion step you keep a single candidate fixed and you try to
> derive other IVs from that fixed IV if doing so lowers the cost?  Right?
More or less, it like below process:
Candidates set IVS (fixed point by current algorithm) --> Try extend
IVS with a new one CAND if it has lower local cost for any uses(There
will be a candidate set IVS', candidates in this subset of IVS are
replaced with CAND for some uses) --> Replace rest candidates in IVS
if they are in the set of IVS' with any other candidates having lower
local cost)  --> Prune the new set.

The first/last steps are actually from current implementation, I added
the middle step.  Generally yes, it expands fixed point got by current
algorithm to see if there is lower cost choice.  Maybe the comment
wasn't clear enough, I will see how to refine it.

>
>
> That's what it looks like to me when reading the code/comments.  Just want
> to make sure it's working the way I think it is.
>
> FWIW, if I read the slides correctly, GCC's IV code was called out as being
> one of the reasons GCC is generating better code for AArch64 than LLVM at a
> recent LLVM conference.  Glad to see that we're called out in that way and
> that we continue to try to improve in that space as well.
>
> Jeff

I don't know what to say...  IVOPT doesn't work very well on aarch64
and I have a not short todo list for it.  Of course some of them are
target indenpendent.

Thanks,
bin

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH PR62178]Improve candidate selecting in IVOPT, 2nd try.
  2014-12-05 12:15 [PATCH PR62178]Improve candidate selecting in IVOPT, 2nd try Bin Cheng
  2014-12-09 22:58 ` Jeff Law
@ 2014-12-10 13:47 ` Richard Biener
  2014-12-11  9:56   ` Bin.Cheng
  1 sibling, 1 reply; 14+ messages in thread
From: Richard Biener @ 2014-12-10 13:47 UTC (permalink / raw)
  To: Bin Cheng; +Cc: GCC Patches, Zdenek Dvorak

On Fri, Dec 5, 2014 at 1:15 PM, Bin Cheng <bin.cheng@arm.com> wrote:
> Hi,
> Though PR62178 is hidden by recent cost change in aarch64 backend, the ivopt
> issue still exists.
>
> Current candidate selecting algorithm tends to select fewer candidates given
> below reasons:
>   1) to better handle loops with many induction uses but the best choice is
> one generic basic induction variable;
>   2) to keep compilation time low.
>
> One fundamental weakness of the strategy is the opposite situation can't be
> handled properly sometimes.  For these cases the best choice is each
> induction variable has its own candidate.
> This patch fixes the problem by shuffling candidate set after fix-point is
> reached by current implementation.  The reason why this strategy works is it
> replaces candidate set by selecting local optimal candidate for some
> induction uses, and the new candidate set (has lower cost) is exact what we
> want in the mentioned case.  Instrumentation data shows this can find better
> candidates set for ~6% loops in spec2006 on x86_64, and ~4% on aarch64.
>
> This patch actually is extension to the first version patch posted at
> https://gcc.gnu.org/ml/gcc-patches/2014-09/msg02620.html, that only adds
> another selecting pass with special seed set (more or less like the shuffled
> set in this patch).  Data also confirms this patch can find optimal sets for
> most loops found by the first one, as well as optimal sets for many new
> loops.
>
> Bootstrap and test on x86_64, no regression on benchmarks.  Bootstrap and
> test on aarch64.
> Since this patch only selects candidate set with lower cost, any regressions
> revealed are latent bugs of other components in GCC.
> I also collected GCC bootstrap time on x86_64, no regression either.
> Is this OK?

The algorithm seems to be quadratic in the number of IV candidates
(at least):

+         for (i = 0; i < n_iv_cands (data); i++)
+           {
...
+             iv_ca_replace (data, ivs, cand, act_delta, &tmp_delta);
...

and

+static void
+iv_ca_replace (struct ivopts_data *data, struct iv_ca *ivs,
+              struct iv_cand *cand, struct iv_ca_delta *act_delta,
+              struct iv_ca_delta **delta)
+{
...
+  for (i = 0; i < ivs->upto; i++)
+    {
...
+      if (data->consider_all_candidates)
+       {
+         for (j = 0; j < n_iv_cands (data); j++)
+           {

possibly cubic if ivs->upto is of similar value.

I wonder if it is possible to restrict this to the single IV with
the largest delta?  After all we are iterating try_improve_iv_set.
Alternatively move the handling out of iteration completey,
thus into the caller of try_improve_iv_set?

Note that compile-time issues always arise in auto-generated code,
not during GCC bootstrap.

Richard.


> 2014-12-03  Bin Cheng  bin.cheng@arm.com
>
>   PR tree-optimization/62178
>   * tree-ssa-loop-ivopts.c (iv_ca_replace): New function.
>   (try_improve_iv_set): Shuffle candidates set in order to handle
>   case in which candidate wrto each iv use should be selected.
>
> gcc/testsuite/ChangeLog
> 2014-12-03  Bin Cheng  bin.cheng@arm.com
>
>   PR tree-optimization/62178
>   * gcc.target/aarch64/pr62178.c: New test.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH PR62178]Improve candidate selecting in IVOPT, 2nd try.
  2014-12-10 13:47 ` Richard Biener
@ 2014-12-11  9:56   ` Bin.Cheng
  2014-12-11  9:57     ` Bin.Cheng
                       ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Bin.Cheng @ 2014-12-11  9:56 UTC (permalink / raw)
  To: Richard Biener; +Cc: Bin Cheng, GCC Patches, Zdenek Dvorak

On Wed, Dec 10, 2014 at 9:47 PM, Richard Biener
<richard.guenther@gmail.com> wrote:
> On Fri, Dec 5, 2014 at 1:15 PM, Bin Cheng <bin.cheng@arm.com> wrote:
>> Hi,
>> Though PR62178 is hidden by recent cost change in aarch64 backend, the ivopt
>> issue still exists.
>>
>> Current candidate selecting algorithm tends to select fewer candidates given
>> below reasons:
>>   1) to better handle loops with many induction uses but the best choice is
>> one generic basic induction variable;
>>   2) to keep compilation time low.
>>
>> One fundamental weakness of the strategy is the opposite situation can't be
>> handled properly sometimes.  For these cases the best choice is each
>> induction variable has its own candidate.
>> This patch fixes the problem by shuffling candidate set after fix-point is
>> reached by current implementation.  The reason why this strategy works is it
>> replaces candidate set by selecting local optimal candidate for some
>> induction uses, and the new candidate set (has lower cost) is exact what we
>> want in the mentioned case.  Instrumentation data shows this can find better
>> candidates set for ~6% loops in spec2006 on x86_64, and ~4% on aarch64.
>>
>> This patch actually is extension to the first version patch posted at
>> https://gcc.gnu.org/ml/gcc-patches/2014-09/msg02620.html, that only adds
>> another selecting pass with special seed set (more or less like the shuffled
>> set in this patch).  Data also confirms this patch can find optimal sets for
>> most loops found by the first one, as well as optimal sets for many new
>> loops.
>>
>> Bootstrap and test on x86_64, no regression on benchmarks.  Bootstrap and
>> test on aarch64.
>> Since this patch only selects candidate set with lower cost, any regressions
>> revealed are latent bugs of other components in GCC.
>> I also collected GCC bootstrap time on x86_64, no regression either.
>> Is this OK?
>
> The algorithm seems to be quadratic in the number of IV candidates
> (at least):
Yes, I worried about that too, that's why I measured the bootstrap
time.  One way is restrict this procedure one time for each loop.  I
already tried that and it can capture +90% loops.  Is this sounds
reasonable?

BTW, do we have some compilation time benchmarks for GCC?

Thanks,
bin
>
> +         for (i = 0; i < n_iv_cands (data); i++)
> +           {
> ...
> +             iv_ca_replace (data, ivs, cand, act_delta, &tmp_delta);
> ...
>
> and
>
> +static void
> +iv_ca_replace (struct ivopts_data *data, struct iv_ca *ivs,
> +              struct iv_cand *cand, struct iv_ca_delta *act_delta,
> +              struct iv_ca_delta **delta)
> +{
> ...
> +  for (i = 0; i < ivs->upto; i++)
> +    {
> ...
> +      if (data->consider_all_candidates)
> +       {
> +         for (j = 0; j < n_iv_cands (data); j++)
> +           {
>
> possibly cubic if ivs->upto is of similar value.
>
> I wonder if it is possible to restrict this to the single IV with
> the largest delta?  After all we are iterating try_improve_iv_set.
> Alternatively move the handling out of iteration completey,
> thus into the caller of try_improve_iv_set?
>
> Note that compile-time issues always arise in auto-generated code,
> not during GCC bootstrap.
>
> Richard.
>
>
>> 2014-12-03  Bin Cheng  bin.cheng@arm.com
>>
>>   PR tree-optimization/62178
>>   * tree-ssa-loop-ivopts.c (iv_ca_replace): New function.
>>   (try_improve_iv_set): Shuffle candidates set in order to handle
>>   case in which candidate wrto each iv use should be selected.
>>
>> gcc/testsuite/ChangeLog
>> 2014-12-03  Bin Cheng  bin.cheng@arm.com
>>
>>   PR tree-optimization/62178
>>   * gcc.target/aarch64/pr62178.c: New test.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH PR62178]Improve candidate selecting in IVOPT, 2nd try.
  2014-12-11  9:56   ` Bin.Cheng
@ 2014-12-11  9:57     ` Bin.Cheng
  2014-12-11 12:09     ` Richard Biener
  2014-12-15 20:59     ` Sebastian Pop
  2 siblings, 0 replies; 14+ messages in thread
From: Bin.Cheng @ 2014-12-11  9:57 UTC (permalink / raw)
  To: Richard Biener; +Cc: Bin Cheng, GCC Patches, Zdenek Dvorak

On Thu, Dec 11, 2014 at 5:56 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
> On Wed, Dec 10, 2014 at 9:47 PM, Richard Biener
> <richard.guenther@gmail.com> wrote:
>> On Fri, Dec 5, 2014 at 1:15 PM, Bin Cheng <bin.cheng@arm.com> wrote:
>>> Hi,
>>> Though PR62178 is hidden by recent cost change in aarch64 backend, the ivopt
>>> issue still exists.
>>>
>>> Current candidate selecting algorithm tends to select fewer candidates given
>>> below reasons:
>>>   1) to better handle loops with many induction uses but the best choice is
>>> one generic basic induction variable;
>>>   2) to keep compilation time low.
>>>
>>> One fundamental weakness of the strategy is the opposite situation can't be
>>> handled properly sometimes.  For these cases the best choice is each
>>> induction variable has its own candidate.
>>> This patch fixes the problem by shuffling candidate set after fix-point is
>>> reached by current implementation.  The reason why this strategy works is it
>>> replaces candidate set by selecting local optimal candidate for some
>>> induction uses, and the new candidate set (has lower cost) is exact what we
>>> want in the mentioned case.  Instrumentation data shows this can find better
>>> candidates set for ~6% loops in spec2006 on x86_64, and ~4% on aarch64.
>>>
>>> This patch actually is extension to the first version patch posted at
>>> https://gcc.gnu.org/ml/gcc-patches/2014-09/msg02620.html, that only adds
>>> another selecting pass with special seed set (more or less like the shuffled
>>> set in this patch).  Data also confirms this patch can find optimal sets for
>>> most loops found by the first one, as well as optimal sets for many new
>>> loops.
>>>
>>> Bootstrap and test on x86_64, no regression on benchmarks.  Bootstrap and
>>> test on aarch64.
>>> Since this patch only selects candidate set with lower cost, any regressions
>>> revealed are latent bugs of other components in GCC.
>>> I also collected GCC bootstrap time on x86_64, no regression either.
>>> Is this OK?
>>
>> The algorithm seems to be quadratic in the number of IV candidates
>> (at least):
> Yes, I worried about that too, that's why I measured the bootstrap
> time.  One way is restrict this procedure one time for each loop.  I
> already tried that and it can capture +90% loops.  Is this sounds
> reasonable?
By +90%, I mean 90% from the 6% improved loops, not the total loop number...
>
> BTW, do we have some compilation time benchmarks for GCC?
>
> Thanks,
> bin
>>
>> +         for (i = 0; i < n_iv_cands (data); i++)
>> +           {
>> ...
>> +             iv_ca_replace (data, ivs, cand, act_delta, &tmp_delta);
>> ...
>>
>> and
>>
>> +static void
>> +iv_ca_replace (struct ivopts_data *data, struct iv_ca *ivs,
>> +              struct iv_cand *cand, struct iv_ca_delta *act_delta,
>> +              struct iv_ca_delta **delta)
>> +{
>> ...
>> +  for (i = 0; i < ivs->upto; i++)
>> +    {
>> ...
>> +      if (data->consider_all_candidates)
>> +       {
>> +         for (j = 0; j < n_iv_cands (data); j++)
>> +           {
>>
>> possibly cubic if ivs->upto is of similar value.
>>
>> I wonder if it is possible to restrict this to the single IV with
>> the largest delta?  After all we are iterating try_improve_iv_set.
>> Alternatively move the handling out of iteration completey,
>> thus into the caller of try_improve_iv_set?
>>
>> Note that compile-time issues always arise in auto-generated code,
>> not during GCC bootstrap.
>>
>> Richard.
>>
>>
>>> 2014-12-03  Bin Cheng  bin.cheng@arm.com
>>>
>>>   PR tree-optimization/62178
>>>   * tree-ssa-loop-ivopts.c (iv_ca_replace): New function.
>>>   (try_improve_iv_set): Shuffle candidates set in order to handle
>>>   case in which candidate wrto each iv use should be selected.
>>>
>>> gcc/testsuite/ChangeLog
>>> 2014-12-03  Bin Cheng  bin.cheng@arm.com
>>>
>>>   PR tree-optimization/62178
>>>   * gcc.target/aarch64/pr62178.c: New test.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH PR62178]Improve candidate selecting in IVOPT, 2nd try.
  2014-12-11  9:56   ` Bin.Cheng
  2014-12-11  9:57     ` Bin.Cheng
@ 2014-12-11 12:09     ` Richard Biener
  2014-12-16  8:43       ` Bin.Cheng
  2014-12-15 20:59     ` Sebastian Pop
  2 siblings, 1 reply; 14+ messages in thread
From: Richard Biener @ 2014-12-11 12:09 UTC (permalink / raw)
  To: Bin.Cheng; +Cc: Bin Cheng, GCC Patches, Zdenek Dvorak

On Thu, Dec 11, 2014 at 10:56 AM, Bin.Cheng <amker.cheng@gmail.com> wrote:
> On Wed, Dec 10, 2014 at 9:47 PM, Richard Biener
> <richard.guenther@gmail.com> wrote:
>> On Fri, Dec 5, 2014 at 1:15 PM, Bin Cheng <bin.cheng@arm.com> wrote:
>>> Hi,
>>> Though PR62178 is hidden by recent cost change in aarch64 backend, the ivopt
>>> issue still exists.
>>>
>>> Current candidate selecting algorithm tends to select fewer candidates given
>>> below reasons:
>>>   1) to better handle loops with many induction uses but the best choice is
>>> one generic basic induction variable;
>>>   2) to keep compilation time low.
>>>
>>> One fundamental weakness of the strategy is the opposite situation can't be
>>> handled properly sometimes.  For these cases the best choice is each
>>> induction variable has its own candidate.
>>> This patch fixes the problem by shuffling candidate set after fix-point is
>>> reached by current implementation.  The reason why this strategy works is it
>>> replaces candidate set by selecting local optimal candidate for some
>>> induction uses, and the new candidate set (has lower cost) is exact what we
>>> want in the mentioned case.  Instrumentation data shows this can find better
>>> candidates set for ~6% loops in spec2006 on x86_64, and ~4% on aarch64.
>>>
>>> This patch actually is extension to the first version patch posted at
>>> https://gcc.gnu.org/ml/gcc-patches/2014-09/msg02620.html, that only adds
>>> another selecting pass with special seed set (more or less like the shuffled
>>> set in this patch).  Data also confirms this patch can find optimal sets for
>>> most loops found by the first one, as well as optimal sets for many new
>>> loops.
>>>
>>> Bootstrap and test on x86_64, no regression on benchmarks.  Bootstrap and
>>> test on aarch64.
>>> Since this patch only selects candidate set with lower cost, any regressions
>>> revealed are latent bugs of other components in GCC.
>>> I also collected GCC bootstrap time on x86_64, no regression either.
>>> Is this OK?
>>
>> The algorithm seems to be quadratic in the number of IV candidates
>> (at least):
> Yes, I worried about that too, that's why I measured the bootstrap
> time.  One way is restrict this procedure one time for each loop.  I
> already tried that and it can capture +90% loops.  Is this sounds
> reasonable?

Yes.  That's my suggestion to handle it in the caller of try_improve_iv_set?

> BTW, do we have some compilation time benchmarks for GCC?

There are various testcases linked from PR47344, I don't remember
any particular one putting load on IVOPTs (but I do remember seeing
IVOPTs in the ~25% area in -ftime-report for some testcases).

Thanks,
Richard.

> Thanks,
> bin
>>
>> +         for (i = 0; i < n_iv_cands (data); i++)
>> +           {
>> ...
>> +             iv_ca_replace (data, ivs, cand, act_delta, &tmp_delta);
>> ...
>>
>> and
>>
>> +static void
>> +iv_ca_replace (struct ivopts_data *data, struct iv_ca *ivs,
>> +              struct iv_cand *cand, struct iv_ca_delta *act_delta,
>> +              struct iv_ca_delta **delta)
>> +{
>> ...
>> +  for (i = 0; i < ivs->upto; i++)
>> +    {
>> ...
>> +      if (data->consider_all_candidates)
>> +       {
>> +         for (j = 0; j < n_iv_cands (data); j++)
>> +           {
>>
>> possibly cubic if ivs->upto is of similar value.
>>
>> I wonder if it is possible to restrict this to the single IV with
>> the largest delta?  After all we are iterating try_improve_iv_set.
>> Alternatively move the handling out of iteration completey,
>> thus into the caller of try_improve_iv_set?
>>
>> Note that compile-time issues always arise in auto-generated code,
>> not during GCC bootstrap.
>>
>> Richard.
>>
>>
>>> 2014-12-03  Bin Cheng  bin.cheng@arm.com
>>>
>>>   PR tree-optimization/62178
>>>   * tree-ssa-loop-ivopts.c (iv_ca_replace): New function.
>>>   (try_improve_iv_set): Shuffle candidates set in order to handle
>>>   case in which candidate wrto each iv use should be selected.
>>>
>>> gcc/testsuite/ChangeLog
>>> 2014-12-03  Bin Cheng  bin.cheng@arm.com
>>>
>>>   PR tree-optimization/62178
>>>   * gcc.target/aarch64/pr62178.c: New test.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH PR62178]Improve candidate selecting in IVOPT, 2nd try.
  2014-12-11  9:56   ` Bin.Cheng
  2014-12-11  9:57     ` Bin.Cheng
  2014-12-11 12:09     ` Richard Biener
@ 2014-12-15 20:59     ` Sebastian Pop
  2 siblings, 0 replies; 14+ messages in thread
From: Sebastian Pop @ 2014-12-15 20:59 UTC (permalink / raw)
  To: Bin.Cheng; +Cc: Richard Biener, Bin Cheng, GCC Patches, Zdenek Dvorak

Bin.Cheng wrote:
> do we have some compilation time benchmarks for GCC?

I'm using the llvm test-suite to see compile time differences:

$ git clone http://llvm.org/git/test-suite.git /path/to/test-suite
$ /path/to/test-suite/configure --without-llvmsrc --without-llvmobj --with-externals=/path/to/spec
$ make -k TEST=simple TARGET_LLVMGCC=/path/to/gcc TARGET_CXX=/path/to/g++ TARGET_CC=/path/to/gcc TARGET_LLVMGXX=/path/to/g++ CC_UNDER_TEST_IS_GCC=1 TARGET_FLAGS=  USE_REFERENCE_OUTPUT=1 CC_UNDER_TEST_TARGET_IS_AARCH64=1 OPTFLAGS="-O3" LLC_OPTFLAGS="-O3" ENABLE_OPTIMIZED=1 ARCH=AArch64 ENABLE_HASHED_PROGRAM_OUTPUT=1 DISABLE_JIT=1 report report.simple.csv
$ head -1 report.simple.csv
Program,CC,CC_Time,CC_Real_Time,Exec,Exec_Time,Exec_Real_Time
$ awk -F, '{print $1, $3 }' < report.simple.csv

Here is how to get benchmark code size:
$ make -k TEST=codesize TARGET_LLVMGCC=/path/to/gcc TARGET_CXX=/path/to/g++ TARGET_CC=/path/to/gcc TARGET_LLVMGXX=/path/to/g++ TARGET_FLAGS= USE_REFERENCE_OUTPUT=1 CC_UNDER_TEST_TARGET_IS_AARCH64=1 CC_UNDER_TEST_IS_CLANG=1 OPTFLAGS="-O3" LLC_OPTFLAGS="-O3" ENABLE_OPTIMIZED=1 ARCH=AArch64 ENABLE_HASHED_PROGRAM_OUTPUT=1 DISABLE_JIT=1 2>/dev/null | grep ^size: > test.codesize.txt

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH PR62178]Improve candidate selecting in IVOPT, 2nd try.
  2014-12-11 12:09     ` Richard Biener
@ 2014-12-16  8:43       ` Bin.Cheng
  2014-12-16  8:52         ` Fwd: " Bin.Cheng
                           ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Bin.Cheng @ 2014-12-16  8:43 UTC (permalink / raw)
  To: Richard Biener; +Cc: Bin Cheng, GCC Patches, Zdenek Dvorak

[-- Attachment #1: Type: text/plain, Size: 4161 bytes --]

On Thu, Dec 11, 2014 at 8:08 PM, Richard Biener
<richard.guenther@gmail.com> wrote:
> On Thu, Dec 11, 2014 at 10:56 AM, Bin.Cheng <amker.cheng@gmail.com> wrote:
>> On Wed, Dec 10, 2014 at 9:47 PM, Richard Biener
>> <richard.guenther@gmail.com> wrote:
>>> On Fri, Dec 5, 2014 at 1:15 PM, Bin Cheng <bin.cheng@arm.com> wrote:
>>>> Hi,
>>>> Though PR62178 is hidden by recent cost change in aarch64 backend, the ivopt
>>>> issue still exists.
>>>>
>>>> Current candidate selecting algorithm tends to select fewer candidates given
>>>> below reasons:
>>>>   1) to better handle loops with many induction uses but the best choice is
>>>> one generic basic induction variable;
>>>>   2) to keep compilation time low.
>>>>
>>>> One fundamental weakness of the strategy is the opposite situation can't be
>>>> handled properly sometimes.  For these cases the best choice is each
>>>> induction variable has its own candidate.
>>>> This patch fixes the problem by shuffling candidate set after fix-point is
>>>> reached by current implementation.  The reason why this strategy works is it
>>>> replaces candidate set by selecting local optimal candidate for some
>>>> induction uses, and the new candidate set (has lower cost) is exact what we
>>>> want in the mentioned case.  Instrumentation data shows this can find better
>>>> candidates set for ~6% loops in spec2006 on x86_64, and ~4% on aarch64.
>>>>
>>>> This patch actually is extension to the first version patch posted at
>>>> https://gcc.gnu.org/ml/gcc-patches/2014-09/msg02620.html, that only adds
>>>> another selecting pass with special seed set (more or less like the shuffled
>>>> set in this patch).  Data also confirms this patch can find optimal sets for
>>>> most loops found by the first one, as well as optimal sets for many new
>>>> loops.
>>>>
>>>> Bootstrap and test on x86_64, no regression on benchmarks.  Bootstrap and
>>>> test on aarch64.
>>>> Since this patch only selects candidate set with lower cost, any regressions
>>>> revealed are latent bugs of other components in GCC.
>>>> I also collected GCC bootstrap time on x86_64, no regression either.
>>>> Is this OK?
>>>
>>> The algorithm seems to be quadratic in the number of IV candidates
>>> (at least):
>> Yes, I worried about that too, that's why I measured the bootstrap
>> time.  One way is restrict this procedure one time for each loop.  I
>> already tried that and it can capture +90% loops.  Is this sounds
>> reasonable?
>
> Yes.  That's my suggestion to handle it in the caller of try_improve_iv_set?
>
>> BTW, do we have some compilation time benchmarks for GCC?
>
> There are various testcases linked from PR47344, I don't remember
> any particular one putting load on IVOPTs (but I do remember seeing
> IVOPTs in the ~25% area in -ftime-report for some testcases).


Hi Jeff & Richard,
I updated patch according to your review comments.  Is this version looks good?
I didn't find cases in PR47344 which exercising IVOPT, but I produced
one case from PR53852 which runs ivopt for ~17% of total time (28s).
This patch does increase IVOPT time to 18%.  Unfortunately, I tried
the other restriction, it doesn't work as well as this one on spec2k6,
if I understood the method correctly.

Hi Sebastian,
Thanks for help!  I managed to run llvm compilation time tests
successfully as you suggested.  Case
Multisource/Benchmarks/mafft/pairlocalalign is regressed but I can't
reproduce it in cmd.  The running time of compilation of
pairlocalalign.c is too small comparing to the results.  I also tried
to invoke it by using RunSafely.sh but no lucky either.  So any
documentation on this?  Thanks very much!

Thanks,
bin

2014-12-16  Bin Cheng  <bin.cheng@arm.com>

    PR tree-optimization/62178
    * tree-ssa-loop-ivopts.c (cheaper_cost_with_cand): New function.
    (iv_ca_replace): New function.
    (try_improve_iv_set): New parameter try_replace_p.
    Replace candidates in IVS by calling iv_ca_replace.
    (find_optimal_iv_set_1): Pass new argument to try_improve_iv_set.

gcc/testsuite/ChangeLog
2014-12-16  Bin Cheng  <bin.cheng@arm.com>

    PR tree-optimization/62178
    * gcc.target/aarch64/pr62178.c: New test.

[-- Attachment #2: pr62178-20141216.txt --]
[-- Type: text/plain, Size: 7180 bytes --]

Index: gcc/tree-ssa-loop-ivopts.c
===================================================================
--- gcc/tree-ssa-loop-ivopts.c	(revision 218200)
+++ gcc/tree-ssa-loop-ivopts.c	(working copy)
@@ -5862,6 +5862,127 @@ iv_ca_prune (struct ivopts_data *data, struct iv_c
   return best_cost;
 }
 
+/* Check if CAND_IDX is a candidate other than OLD_CAND and has
+   cheaper local cost for USE than BEST_CP.  Return pointer to
+   the corresponding cost_pair, otherwise just return BEST_CP.  */
+
+static struct cost_pair*
+cheaper_cost_with_cand (struct ivopts_data *data, struct iv_use *use,
+			unsigned int cand_idx, struct iv_cand *old_cand,
+			struct cost_pair *best_cp)
+{
+  struct iv_cand *cand;
+  struct cost_pair *cp;
+
+  gcc_assert (old_cand != NULL);
+  if (cand_idx == old_cand->id)
+    return best_cp;
+
+  cand = iv_cand (data, cand_idx);
+  cp = get_use_iv_cost (data, use, cand);
+  if (cp != NULL
+      && (best_cp == NULL || cheaper_cost_pair (cp, best_cp)))
+    return cp;
+
+  return best_cp;
+}
+
+/* Try replacing candidates in IVS which are used more than once.  In the
+   first step, this function replaces candidates in IVS with CAND if it
+   has lower local cost.  In the second step, it replaces candidate for
+   any iv use if the use is represented by a candidate has already been
+   replaced in the first step.  In the last step, this function tries to
+   prune the new IVS.  Replacement of candidate is recorded in DELTA.  */
+
+static comp_cost
+iv_ca_replace (struct ivopts_data *data, struct iv_ca *ivs,
+	       struct iv_cand *cand, struct iv_ca_delta **delta)
+{
+  comp_cost cost;
+  unsigned int i, j;
+  bitmap_iterator bi;
+  struct iv_use *use;
+  bool should_replace;
+  struct iv_ca_delta *act, *tmp_delta;
+  struct cost_pair *old_cp, *best_cp = NULL, *cp;
+
+  *delta = NULL;
+  /*  First step, extend current IVS.  */
+  for (i = 0; i < ivs->upto; i++)
+    {
+      use = iv_use (data, i);
+      old_cp = iv_ca_cand_for_use (ivs, use);
+
+      if (old_cp && old_cp->cand == cand)
+	continue;
+
+      /* Skip if the candidate is only used once in IVS.  */
+      if (ivs->n_cand_uses[old_cp->cand->id] == 1)
+	continue;
+
+      cp = get_use_iv_cost (data, use, cand);
+      if (!cp || !cheaper_cost_pair (cp, old_cp))
+	continue;
+
+      *delta = iv_ca_delta_add (use, old_cp, cp, *delta);
+    }
+
+  /* No need for further replacement.  */
+  if (*delta == NULL)
+    return infinite_cost;
+
+  iv_ca_delta_commit (data, ivs, *delta, true);
+  tmp_delta = NULL;
+  /* Second step, replace candidate for any iv use if the use is
+     represented by a candidate has already been replaced in the
+     first step.  */
+  for (i = 0; i < ivs->upto; i++)
+    {
+      use = iv_use (data, i);
+
+      old_cp = iv_ca_cand_for_use (ivs, use);
+      if (old_cp->cand == cand)
+	continue;
+
+      should_replace = false;
+      for (act = *delta; act; act = act->next_change)
+	if (old_cp->cand == act->old_cp->cand)
+	  {
+	    should_replace = true;
+	    break;
+	  }
+      if (!should_replace)
+	continue;
+
+      best_cp = NULL;
+      if (data->consider_all_candidates)
+	for (j = 0; j < n_iv_cands (data); j++)
+	  best_cp = cheaper_cost_with_cand (data, use, j,
+					    old_cp->cand, best_cp);
+      else
+	EXECUTE_IF_SET_IN_BITMAP (use->related_cands, 0, j, bi)
+	  best_cp = cheaper_cost_with_cand (data, use, j,
+					    old_cp->cand, best_cp);
+
+      if (!best_cp)
+	continue;
+
+      tmp_delta = iv_ca_delta_add (use, old_cp, best_cp, tmp_delta);
+    }
+
+  cost = iv_ca_cost (ivs);
+  /* Last step, prune IVS if necessary.  */
+  if (tmp_delta)
+    {
+      iv_ca_delta_commit (data, ivs, tmp_delta, true);
+      *delta = iv_ca_delta_join (*delta, tmp_delta);
+      cost = iv_ca_prune (data, ivs, cand, &tmp_delta);
+    }
+  iv_ca_delta_commit (data, ivs, *delta, false);
+  *delta = iv_ca_delta_join (*delta, tmp_delta);
+  return cost;
+}
+
 /* Tries to extend the sets IVS in the best possible way in order
    to express the USE.  If ORIGINALP is true, prefer candidates from
    the original set of IVs, otherwise favor important candidates not
@@ -5995,10 +6116,13 @@ get_initial_solution (struct ivopts_data *data, bo
   return ivs;
 }
 
-/* Tries to improve set of induction variables IVS.  */
+/* Tries to improve set of induction variables IVS.  TRY_REPLACE_P
+   points to a bool variable, this function tries to replace IVS at
+   fixed point if it's true.  */
 
 static bool
-try_improve_iv_set (struct ivopts_data *data, struct iv_ca *ivs)
+try_improve_iv_set (struct ivopts_data *data,
+		    struct iv_ca *ivs, bool *try_replace_p)
 {
   unsigned i, n_ivs;
   comp_cost acost, best_cost = iv_ca_cost (ivs);
@@ -6042,7 +6166,35 @@ static bool
       /* Try removing the candidates from the set instead.  */
       best_cost = iv_ca_prune (data, ivs, NULL, &best_delta);
 
-      /* Nothing more we can do.  */
+      if (!best_delta && *try_replace_p)
+	{
+	  *try_replace_p = false;
+	  /* So far candidate selecting algorithm tends to choose fewer IVs
+	     so that it can handle cases in which loops have many variables
+	     but the best choice is often to use only one general biv.  One
+	     weakness is it can't handle opposite cases, in which different
+	     candidates should be chosen with respect to each use.  To solve
+	     the problem, we replace candidates in a manner described by the
+	     comments of iv_ca_replace, thus give general algorithm a chance
+	     to break local optimal fixed-point in these cases.  */
+	  for (i = 0; i < n_iv_cands (data); i++)
+	    {
+	      cand = iv_cand (data, i);
+	      if (iv_ca_cand_used_p (ivs, cand))
+		continue;
+
+	      acost = iv_ca_replace (data, ivs, cand, &act_delta);
+	      if (compare_costs (acost, best_cost) < 0)
+		{
+		  best_cost = acost;
+		  iv_ca_delta_free (&best_delta);
+		  best_delta = act_delta;
+		}
+	      else
+		iv_ca_delta_free (&act_delta);
+	    }
+	}
+
       if (!best_delta)
 	return false;
     }
@@ -6061,6 +6213,7 @@ static struct iv_ca *
 find_optimal_iv_set_1 (struct ivopts_data *data, bool originalp)
 {
   struct iv_ca *set;
+  bool try_replace_p = true;
 
   /* Get the initial solution.  */
   set = get_initial_solution (data, originalp);
@@ -6077,7 +6230,7 @@ find_optimal_iv_set_1 (struct ivopts_data *data, b
       iv_ca_dump (data, dump_file, set);
     }
 
-  while (try_improve_iv_set (data, set))
+  while (try_improve_iv_set (data, set, &try_replace_p))
     {
       if (dump_file && (dump_flags & TDF_DETAILS))
 	{
Index: gcc/testsuite/gcc.target/aarch64/pr62178.c
===================================================================
--- gcc/testsuite/gcc.target/aarch64/pr62178.c	(revision 0)
+++ gcc/testsuite/gcc.target/aarch64/pr62178.c	(revision 0)
@@ -0,0 +1,17 @@
+/* { dg-do compile } */
+/* { dg-options "-O3" } */
+
+int a[30 +1][30 +1], b[30 +1][30 +1], r[30 +1][30 +1];
+
+void foo (void) {
+  int i, j, k;
+
+  for ( i = 1; i <= 30; i++ )
+    for ( j = 1; j <= 30; j++ ) {
+      r[i][j] = 0;
+      for(k = 1; k <= 30; k++ )
+        r[i][j] += a[i][k]*b[k][j];
+    }
+}
+
+/* { dg-final { scan-assembler "ld1r\\t\{v\[0-9\]+\."} } */

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Fwd: [PATCH PR62178]Improve candidate selecting in IVOPT, 2nd try.
  2014-12-16  8:43       ` Bin.Cheng
@ 2014-12-16  8:52         ` Bin.Cheng
  2014-12-16 18:18           ` Sebastian Pop
  2014-12-16 13:16         ` Bin.Cheng
  2014-12-17  9:54         ` Bin.Cheng
  2 siblings, 1 reply; 14+ messages in thread
From: Bin.Cheng @ 2014-12-16  8:52 UTC (permalink / raw)
  To: gcc-patches List; +Cc: Sebastian Pop

[-- Attachment #1: Type: text/plain, Size: 4533 bytes --]

CCing Sebastian.

Thanks,
bin

---------- Forwarded message ----------
From: Bin.Cheng <amker.cheng@gmail.com>
Date: Tue, Dec 16, 2014 at 4:42 PM
Subject: Re: [PATCH PR62178]Improve candidate selecting in IVOPT, 2nd try.
To: Richard Biener <richard.guenther@gmail.com>
Cc: Bin Cheng <bin.cheng@arm.com>, GCC Patches
<gcc-patches@gcc.gnu.org>, Zdenek Dvorak <ook@ucw.cz>


On Thu, Dec 11, 2014 at 8:08 PM, Richard Biener
<richard.guenther@gmail.com> wrote:
> On Thu, Dec 11, 2014 at 10:56 AM, Bin.Cheng <amker.cheng@gmail.com> wrote:
>> On Wed, Dec 10, 2014 at 9:47 PM, Richard Biener
>> <richard.guenther@gmail.com> wrote:
>>> On Fri, Dec 5, 2014 at 1:15 PM, Bin Cheng <bin.cheng@arm.com> wrote:
>>>> Hi,
>>>> Though PR62178 is hidden by recent cost change in aarch64 backend, the ivopt
>>>> issue still exists.
>>>>
>>>> Current candidate selecting algorithm tends to select fewer candidates given
>>>> below reasons:
>>>>   1) to better handle loops with many induction uses but the best choice is
>>>> one generic basic induction variable;
>>>>   2) to keep compilation time low.
>>>>
>>>> One fundamental weakness of the strategy is the opposite situation can't be
>>>> handled properly sometimes.  For these cases the best choice is each
>>>> induction variable has its own candidate.
>>>> This patch fixes the problem by shuffling candidate set after fix-point is
>>>> reached by current implementation.  The reason why this strategy works is it
>>>> replaces candidate set by selecting local optimal candidate for some
>>>> induction uses, and the new candidate set (has lower cost) is exact what we
>>>> want in the mentioned case.  Instrumentation data shows this can find better
>>>> candidates set for ~6% loops in spec2006 on x86_64, and ~4% on aarch64.
>>>>
>>>> This patch actually is extension to the first version patch posted at
>>>> https://gcc.gnu.org/ml/gcc-patches/2014-09/msg02620.html, that only adds
>>>> another selecting pass with special seed set (more or less like the shuffled
>>>> set in this patch).  Data also confirms this patch can find optimal sets for
>>>> most loops found by the first one, as well as optimal sets for many new
>>>> loops.
>>>>
>>>> Bootstrap and test on x86_64, no regression on benchmarks.  Bootstrap and
>>>> test on aarch64.
>>>> Since this patch only selects candidate set with lower cost, any regressions
>>>> revealed are latent bugs of other components in GCC.
>>>> I also collected GCC bootstrap time on x86_64, no regression either.
>>>> Is this OK?
>>>
>>> The algorithm seems to be quadratic in the number of IV candidates
>>> (at least):
>> Yes, I worried about that too, that's why I measured the bootstrap
>> time.  One way is restrict this procedure one time for each loop.  I
>> already tried that and it can capture +90% loops.  Is this sounds
>> reasonable?
>
> Yes.  That's my suggestion to handle it in the caller of try_improve_iv_set?
>
>> BTW, do we have some compilation time benchmarks for GCC?
>
> There are various testcases linked from PR47344, I don't remember
> any particular one putting load on IVOPTs (but I do remember seeing
> IVOPTs in the ~25% area in -ftime-report for some testcases).


Hi Jeff & Richard,
I updated patch according to your review comments.  Is this version looks good?
I didn't find cases in PR47344 which exercising IVOPT, but I produced
one case from PR53852 which runs ivopt for ~17% of total time (28s).
This patch does increase IVOPT time to 18%.  Unfortunately, I tried
the other restriction, it doesn't work as well as this one on spec2k6,
if I understood the method correctly.

Hi Sebastian,
Thanks for help!  I managed to run llvm compilation time tests
successfully as you suggested.  Case
Multisource/Benchmarks/mafft/pairlocalalign is regressed but I can't
reproduce it in cmd.  The running time of compilation of
pairlocalalign.c is too small comparing to the results.  I also tried
to invoke it by using RunSafely.sh but no lucky either.  So any
documentation on this?  Thanks very much!

Thanks,
bin

2014-12-16  Bin Cheng  <bin.cheng@arm.com>

    PR tree-optimization/62178
    * tree-ssa-loop-ivopts.c (cheaper_cost_with_cand): New function.
    (iv_ca_replace): New function.
    (try_improve_iv_set): New parameter try_replace_p.
    Replace candidates in IVS by calling iv_ca_replace.
    (find_optimal_iv_set_1): Pass new argument to try_improve_iv_set.

gcc/testsuite/ChangeLog
2014-12-16  Bin Cheng  <bin.cheng@arm.com>

    PR tree-optimization/62178
    * gcc.target/aarch64/pr62178.c: New test.

[-- Attachment #2: pr62178-20141216.txt --]
[-- Type: text/plain, Size: 7180 bytes --]

Index: gcc/tree-ssa-loop-ivopts.c
===================================================================
--- gcc/tree-ssa-loop-ivopts.c	(revision 218200)
+++ gcc/tree-ssa-loop-ivopts.c	(working copy)
@@ -5862,6 +5862,127 @@ iv_ca_prune (struct ivopts_data *data, struct iv_c
   return best_cost;
 }
 
+/* Check if CAND_IDX is a candidate other than OLD_CAND and has
+   cheaper local cost for USE than BEST_CP.  Return pointer to
+   the corresponding cost_pair, otherwise just return BEST_CP.  */
+
+static struct cost_pair*
+cheaper_cost_with_cand (struct ivopts_data *data, struct iv_use *use,
+			unsigned int cand_idx, struct iv_cand *old_cand,
+			struct cost_pair *best_cp)
+{
+  struct iv_cand *cand;
+  struct cost_pair *cp;
+
+  gcc_assert (old_cand != NULL);
+  if (cand_idx == old_cand->id)
+    return best_cp;
+
+  cand = iv_cand (data, cand_idx);
+  cp = get_use_iv_cost (data, use, cand);
+  if (cp != NULL
+      && (best_cp == NULL || cheaper_cost_pair (cp, best_cp)))
+    return cp;
+
+  return best_cp;
+}
+
+/* Try replacing candidates in IVS which are used more than once.  In the
+   first step, this function replaces candidates in IVS with CAND if it
+   has lower local cost.  In the second step, it replaces candidate for
+   any iv use if the use is represented by a candidate has already been
+   replaced in the first step.  In the last step, this function tries to
+   prune the new IVS.  Replacement of candidate is recorded in DELTA.  */
+
+static comp_cost
+iv_ca_replace (struct ivopts_data *data, struct iv_ca *ivs,
+	       struct iv_cand *cand, struct iv_ca_delta **delta)
+{
+  comp_cost cost;
+  unsigned int i, j;
+  bitmap_iterator bi;
+  struct iv_use *use;
+  bool should_replace;
+  struct iv_ca_delta *act, *tmp_delta;
+  struct cost_pair *old_cp, *best_cp = NULL, *cp;
+
+  *delta = NULL;
+  /*  First step, extend current IVS.  */
+  for (i = 0; i < ivs->upto; i++)
+    {
+      use = iv_use (data, i);
+      old_cp = iv_ca_cand_for_use (ivs, use);
+
+      if (old_cp && old_cp->cand == cand)
+	continue;
+
+      /* Skip if the candidate is only used once in IVS.  */
+      if (ivs->n_cand_uses[old_cp->cand->id] == 1)
+	continue;
+
+      cp = get_use_iv_cost (data, use, cand);
+      if (!cp || !cheaper_cost_pair (cp, old_cp))
+	continue;
+
+      *delta = iv_ca_delta_add (use, old_cp, cp, *delta);
+    }
+
+  /* No need for further replacement.  */
+  if (*delta == NULL)
+    return infinite_cost;
+
+  iv_ca_delta_commit (data, ivs, *delta, true);
+  tmp_delta = NULL;
+  /* Second step, replace candidate for any iv use if the use is
+     represented by a candidate has already been replaced in the
+     first step.  */
+  for (i = 0; i < ivs->upto; i++)
+    {
+      use = iv_use (data, i);
+
+      old_cp = iv_ca_cand_for_use (ivs, use);
+      if (old_cp->cand == cand)
+	continue;
+
+      should_replace = false;
+      for (act = *delta; act; act = act->next_change)
+	if (old_cp->cand == act->old_cp->cand)
+	  {
+	    should_replace = true;
+	    break;
+	  }
+      if (!should_replace)
+	continue;
+
+      best_cp = NULL;
+      if (data->consider_all_candidates)
+	for (j = 0; j < n_iv_cands (data); j++)
+	  best_cp = cheaper_cost_with_cand (data, use, j,
+					    old_cp->cand, best_cp);
+      else
+	EXECUTE_IF_SET_IN_BITMAP (use->related_cands, 0, j, bi)
+	  best_cp = cheaper_cost_with_cand (data, use, j,
+					    old_cp->cand, best_cp);
+
+      if (!best_cp)
+	continue;
+
+      tmp_delta = iv_ca_delta_add (use, old_cp, best_cp, tmp_delta);
+    }
+
+  cost = iv_ca_cost (ivs);
+  /* Last step, prune IVS if necessary.  */
+  if (tmp_delta)
+    {
+      iv_ca_delta_commit (data, ivs, tmp_delta, true);
+      *delta = iv_ca_delta_join (*delta, tmp_delta);
+      cost = iv_ca_prune (data, ivs, cand, &tmp_delta);
+    }
+  iv_ca_delta_commit (data, ivs, *delta, false);
+  *delta = iv_ca_delta_join (*delta, tmp_delta);
+  return cost;
+}
+
 /* Tries to extend the sets IVS in the best possible way in order
    to express the USE.  If ORIGINALP is true, prefer candidates from
    the original set of IVs, otherwise favor important candidates not
@@ -5995,10 +6116,13 @@ get_initial_solution (struct ivopts_data *data, bo
   return ivs;
 }
 
-/* Tries to improve set of induction variables IVS.  */
+/* Tries to improve set of induction variables IVS.  TRY_REPLACE_P
+   points to a bool variable, this function tries to replace IVS at
+   fixed point if it's true.  */
 
 static bool
-try_improve_iv_set (struct ivopts_data *data, struct iv_ca *ivs)
+try_improve_iv_set (struct ivopts_data *data,
+		    struct iv_ca *ivs, bool *try_replace_p)
 {
   unsigned i, n_ivs;
   comp_cost acost, best_cost = iv_ca_cost (ivs);
@@ -6042,7 +6166,35 @@ static bool
       /* Try removing the candidates from the set instead.  */
       best_cost = iv_ca_prune (data, ivs, NULL, &best_delta);
 
-      /* Nothing more we can do.  */
+      if (!best_delta && *try_replace_p)
+	{
+	  *try_replace_p = false;
+	  /* So far candidate selecting algorithm tends to choose fewer IVs
+	     so that it can handle cases in which loops have many variables
+	     but the best choice is often to use only one general biv.  One
+	     weakness is it can't handle opposite cases, in which different
+	     candidates should be chosen with respect to each use.  To solve
+	     the problem, we replace candidates in a manner described by the
+	     comments of iv_ca_replace, thus give general algorithm a chance
+	     to break local optimal fixed-point in these cases.  */
+	  for (i = 0; i < n_iv_cands (data); i++)
+	    {
+	      cand = iv_cand (data, i);
+	      if (iv_ca_cand_used_p (ivs, cand))
+		continue;
+
+	      acost = iv_ca_replace (data, ivs, cand, &act_delta);
+	      if (compare_costs (acost, best_cost) < 0)
+		{
+		  best_cost = acost;
+		  iv_ca_delta_free (&best_delta);
+		  best_delta = act_delta;
+		}
+	      else
+		iv_ca_delta_free (&act_delta);
+	    }
+	}
+
       if (!best_delta)
 	return false;
     }
@@ -6061,6 +6213,7 @@ static struct iv_ca *
 find_optimal_iv_set_1 (struct ivopts_data *data, bool originalp)
 {
   struct iv_ca *set;
+  bool try_replace_p = true;
 
   /* Get the initial solution.  */
   set = get_initial_solution (data, originalp);
@@ -6077,7 +6230,7 @@ find_optimal_iv_set_1 (struct ivopts_data *data, b
       iv_ca_dump (data, dump_file, set);
     }
 
-  while (try_improve_iv_set (data, set))
+  while (try_improve_iv_set (data, set, &try_replace_p))
     {
       if (dump_file && (dump_flags & TDF_DETAILS))
 	{
Index: gcc/testsuite/gcc.target/aarch64/pr62178.c
===================================================================
--- gcc/testsuite/gcc.target/aarch64/pr62178.c	(revision 0)
+++ gcc/testsuite/gcc.target/aarch64/pr62178.c	(revision 0)
@@ -0,0 +1,17 @@
+/* { dg-do compile } */
+/* { dg-options "-O3" } */
+
+int a[30 +1][30 +1], b[30 +1][30 +1], r[30 +1][30 +1];
+
+void foo (void) {
+  int i, j, k;
+
+  for ( i = 1; i <= 30; i++ )
+    for ( j = 1; j <= 30; j++ ) {
+      r[i][j] = 0;
+      for(k = 1; k <= 30; k++ )
+        r[i][j] += a[i][k]*b[k][j];
+    }
+}
+
+/* { dg-final { scan-assembler "ld1r\\t\{v\[0-9\]+\."} } */

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH PR62178]Improve candidate selecting in IVOPT, 2nd try.
  2014-12-16  8:43       ` Bin.Cheng
  2014-12-16  8:52         ` Fwd: " Bin.Cheng
@ 2014-12-16 13:16         ` Bin.Cheng
  2014-12-17  9:54         ` Bin.Cheng
  2 siblings, 0 replies; 14+ messages in thread
From: Bin.Cheng @ 2014-12-16 13:16 UTC (permalink / raw)
  To: Richard Biener; +Cc: Bin Cheng, GCC Patches, Zdenek Dvorak

Please ignore this one, I will further refine it.  Sorry for disturbing!

Thanks,
bin

On Tue, Dec 16, 2014 at 4:42 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
> On Thu, Dec 11, 2014 at 8:08 PM, Richard Biener
> <richard.guenther@gmail.com> wrote:
>> On Thu, Dec 11, 2014 at 10:56 AM, Bin.Cheng <amker.cheng@gmail.com> wrote:
>>> On Wed, Dec 10, 2014 at 9:47 PM, Richard Biener
>>> <richard.guenther@gmail.com> wrote:
>>>> On Fri, Dec 5, 2014 at 1:15 PM, Bin Cheng <bin.cheng@arm.com> wrote:
>>>>> Hi,
>>>>> Though PR62178 is hidden by recent cost change in aarch64 backend, the ivopt
>>>>> issue still exists.
>>>>>
>>>>> Current candidate selecting algorithm tends to select fewer candidates given
>>>>> below reasons:
>>>>>   1) to better handle loops with many induction uses but the best choice is
>>>>> one generic basic induction variable;
>>>>>   2) to keep compilation time low.
>>>>>
>>>>> One fundamental weakness of the strategy is the opposite situation can't be
>>>>> handled properly sometimes.  For these cases the best choice is each
>>>>> induction variable has its own candidate.
>>>>> This patch fixes the problem by shuffling candidate set after fix-point is
>>>>> reached by current implementation.  The reason why this strategy works is it
>>>>> replaces candidate set by selecting local optimal candidate for some
>>>>> induction uses, and the new candidate set (has lower cost) is exact what we
>>>>> want in the mentioned case.  Instrumentation data shows this can find better
>>>>> candidates set for ~6% loops in spec2006 on x86_64, and ~4% on aarch64.
>>>>>
>>>>> This patch actually is extension to the first version patch posted at
>>>>> https://gcc.gnu.org/ml/gcc-patches/2014-09/msg02620.html, that only adds
>>>>> another selecting pass with special seed set (more or less like the shuffled
>>>>> set in this patch).  Data also confirms this patch can find optimal sets for
>>>>> most loops found by the first one, as well as optimal sets for many new
>>>>> loops.
>>>>>
>>>>> Bootstrap and test on x86_64, no regression on benchmarks.  Bootstrap and
>>>>> test on aarch64.
>>>>> Since this patch only selects candidate set with lower cost, any regressions
>>>>> revealed are latent bugs of other components in GCC.
>>>>> I also collected GCC bootstrap time on x86_64, no regression either.
>>>>> Is this OK?
>>>>
>>>> The algorithm seems to be quadratic in the number of IV candidates
>>>> (at least):
>>> Yes, I worried about that too, that's why I measured the bootstrap
>>> time.  One way is restrict this procedure one time for each loop.  I
>>> already tried that and it can capture +90% loops.  Is this sounds
>>> reasonable?
>>
>> Yes.  That's my suggestion to handle it in the caller of try_improve_iv_set?
>>
>>> BTW, do we have some compilation time benchmarks for GCC?
>>
>> There are various testcases linked from PR47344, I don't remember
>> any particular one putting load on IVOPTs (but I do remember seeing
>> IVOPTs in the ~25% area in -ftime-report for some testcases).
>
>
> Hi Jeff & Richard,
> I updated patch according to your review comments.  Is this version looks good?
> I didn't find cases in PR47344 which exercising IVOPT, but I produced
> one case from PR53852 which runs ivopt for ~17% of total time (28s).
> This patch does increase IVOPT time to 18%.  Unfortunately, I tried
> the other restriction, it doesn't work as well as this one on spec2k6,
> if I understood the method correctly.
>
> Hi Sebastian,
> Thanks for help!  I managed to run llvm compilation time tests
> successfully as you suggested.  Case
> Multisource/Benchmarks/mafft/pairlocalalign is regressed but I can't
> reproduce it in cmd.  The running time of compilation of
> pairlocalalign.c is too small comparing to the results.  I also tried
> to invoke it by using RunSafely.sh but no lucky either.  So any
> documentation on this?  Thanks very much!
>
> Thanks,
> bin
>
> 2014-12-16  Bin Cheng  <bin.cheng@arm.com>
>
>     PR tree-optimization/62178
>     * tree-ssa-loop-ivopts.c (cheaper_cost_with_cand): New function.
>     (iv_ca_replace): New function.
>     (try_improve_iv_set): New parameter try_replace_p.
>     Replace candidates in IVS by calling iv_ca_replace.
>     (find_optimal_iv_set_1): Pass new argument to try_improve_iv_set.
>
> gcc/testsuite/ChangeLog
> 2014-12-16  Bin Cheng  <bin.cheng@arm.com>
>
>     PR tree-optimization/62178
>     * gcc.target/aarch64/pr62178.c: New test.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Fwd: [PATCH PR62178]Improve candidate selecting in IVOPT, 2nd try.
  2014-12-16  8:52         ` Fwd: " Bin.Cheng
@ 2014-12-16 18:18           ` Sebastian Pop
  0 siblings, 0 replies; 14+ messages in thread
From: Sebastian Pop @ 2014-12-16 18:18 UTC (permalink / raw)
  To: Bin.Cheng; +Cc: gcc-patches List

Bin.Cheng wrote:
> Multisource/Benchmarks/mafft/pairlocalalign is regressed but I can't
> reproduce it in cmd.  The running time of compilation of
> pairlocalalign.c is too small comparing to the results.  I also tried
> to invoke it by using RunSafely.sh but no lucky either.  So any
> documentation on this?  Thanks very much!

There is not much documentation on running the llvm test-suite.
Here is how I do rerun a single benchmark:

In the build directory, if it is clean, i.e., you have just configure'd, you can
run "make clean" and that will traverse all the directories and create them if
they do not exist.  If you have already run "make TEST=simple" you do not have
to run "make clean" as you already have all the directories under the build dir.

Once you have the benchmark dir in the build dir, just do:
$ cd Multisource/Benchmarks/mafft/pairlocalalign
$ make clean
$ make TEST=simple [... all other variables as mentioned before ...]

this way you will only run that specific benchmark.

If you need to see which commands RunSafely.sh is running, I would suggest you
add some "echo $CMD" or "set -x" in there.
I think by default you do have the compiler commands.

Sebastian

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH PR62178]Improve candidate selecting in IVOPT, 2nd try.
  2014-12-16  8:43       ` Bin.Cheng
  2014-12-16  8:52         ` Fwd: " Bin.Cheng
  2014-12-16 13:16         ` Bin.Cheng
@ 2014-12-17  9:54         ` Bin.Cheng
  2014-12-17 15:14           ` Richard Biener
  2 siblings, 1 reply; 14+ messages in thread
From: Bin.Cheng @ 2014-12-17  9:54 UTC (permalink / raw)
  To: GCC Patches; +Cc: Zdenek Dvorak, Jeff Law, Sebastian Pop, Richard Biener

[-- Attachment #1: Type: text/plain, Size: 4375 bytes --]

On Tue, Dec 16, 2014 at 4:42 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
> On Thu, Dec 11, 2014 at 8:08 PM, Richard Biener
> <richard.guenther@gmail.com> wrote:
>> On Thu, Dec 11, 2014 at 10:56 AM, Bin.Cheng <amker.cheng@gmail.com> wrote:
>>> On Wed, Dec 10, 2014 at 9:47 PM, Richard Biener
>>> <richard.guenther@gmail.com> wrote:
>>>> On Fri, Dec 5, 2014 at 1:15 PM, Bin Cheng <bin.cheng@arm.com> wrote:
>>>>> Hi,
>>>>> Though PR62178 is hidden by recent cost change in aarch64 backend, the ivopt
>>>>> issue still exists.
>>>>>
>>>>> Current candidate selecting algorithm tends to select fewer candidates given
>>>>> below reasons:
>>>>>   1) to better handle loops with many induction uses but the best choice is
>>>>> one generic basic induction variable;
>>>>>   2) to keep compilation time low.
>>>>>
>>>>> One fundamental weakness of the strategy is the opposite situation can't be
>>>>> handled properly sometimes.  For these cases the best choice is each
>>>>> induction variable has its own candidate.
>>>>> This patch fixes the problem by shuffling candidate set after fix-point is
>>>>> reached by current implementation.  The reason why this strategy works is it
>>>>> replaces candidate set by selecting local optimal candidate for some
>>>>> induction uses, and the new candidate set (has lower cost) is exact what we
>>>>> want in the mentioned case.  Instrumentation data shows this can find better
>>>>> candidates set for ~6% loops in spec2006 on x86_64, and ~4% on aarch64.
>>>>>
>>>>> This patch actually is extension to the first version patch posted at
>>>>> https://gcc.gnu.org/ml/gcc-patches/2014-09/msg02620.html, that only adds
>>>>> another selecting pass with special seed set (more or less like the shuffled
>>>>> set in this patch).  Data also confirms this patch can find optimal sets for
>>>>> most loops found by the first one, as well as optimal sets for many new
>>>>> loops.
>>>>>
>>>>> Bootstrap and test on x86_64, no regression on benchmarks.  Bootstrap and
>>>>> test on aarch64.
>>>>> Since this patch only selects candidate set with lower cost, any regressions
>>>>> revealed are latent bugs of other components in GCC.
>>>>> I also collected GCC bootstrap time on x86_64, no regression either.
>>>>> Is this OK?
>>>>
>>>> The algorithm seems to be quadratic in the number of IV candidates
>>>> (at least):
>>> Yes, I worried about that too, that's why I measured the bootstrap
>>> time.  One way is restrict this procedure one time for each loop.  I
>>> already tried that and it can capture +90% loops.  Is this sounds
>>> reasonable?
>>
>> Yes.  That's my suggestion to handle it in the caller of try_improve_iv_set?
>>
>>> BTW, do we have some compilation time benchmarks for GCC?
>>
>> There are various testcases linked from PR47344, I don't remember
>> any particular one putting load on IVOPTs (but I do remember seeing
>> IVOPTs in the ~25% area in -ftime-report for some testcases).
>

Hi,
I further refined the patch.  Specifically, I factored out common
code, improved comments, and restricted new code in several ways, for
example, now iv_ca_replace runs exactly one time for each
find_optimal_iv_set; iv_ca_replace only tries to replace one candidate
in IVS each time and makes quick return if lower cost set is found;
most importantly, iv_ca_replace now checks
ALWAYS_PRUNE_CAND_SET_BOUND.
The patch is simplified with these changes.  As for compilation time,
IVOPT isn't regressed obviously for the overloaded case I created,
also regression in llvm compilation time benchmarks is gone.

I think we could adapt data structure in IVOPT to make it faster, for
example, record information in candidate about which uses are
represented by each cand, sort candidates by cost for each iv use.  I
may do some refactor in next stage1.

Bootstrap on x86_64, test ongoing.  So OK if no regressions?

Thanks,
bin

2014-12-17  Bin Cheng  <bin.cheng@arm.com>

    PR tree-optimization/62178
    * tree-ssa-loop-ivopts.c (cheaper_cost_with_cand): New function.
    (iv_ca_replace): New function.
    (try_improve_iv_set): New parameter try_replace_p.
    Break local optimal fixed-point by calling iv_ca_replace.
    (find_optimal_iv_set_1): Pass new argument to try_improve_iv_set.

gcc/testsuite/ChangeLog
2014-12-17  Bin Cheng  <bin.cheng@arm.com>

    PR tree-optimization/62178
    * gcc.target/aarch64/pr62178.c: New test.

[-- Attachment #2: pr62178-20141217.txt --]
[-- Type: text/plain, Size: 6253 bytes --]

Index: gcc/testsuite/gcc.target/aarch64/pr62178.c
===================================================================
--- gcc/testsuite/gcc.target/aarch64/pr62178.c	(revision 0)
+++ gcc/testsuite/gcc.target/aarch64/pr62178.c	(revision 0)
@@ -0,0 +1,17 @@
+/* { dg-do compile } */
+/* { dg-options "-O3" } */
+
+int a[30 +1][30 +1], b[30 +1][30 +1], r[30 +1][30 +1];
+
+void foo (void) {
+  int i, j, k;
+
+  for ( i = 1; i <= 30; i++ )
+    for ( j = 1; j <= 30; j++ ) {
+      r[i][j] = 0;
+      for(k = 1; k <= 30; k++ )
+        r[i][j] += a[i][k]*b[k][j];
+    }
+}
+
+/* { dg-final { scan-assembler "ld1r\\t\{v\[0-9\]+\."} } */
Index: gcc/tree-ssa-loop-ivopts.c
===================================================================
--- gcc/tree-ssa-loop-ivopts.c	(revision 218200)
+++ gcc/tree-ssa-loop-ivopts.c	(working copy)
@@ -5862,6 +5862,108 @@ iv_ca_prune (struct ivopts_data *data, struct iv_c
   return best_cost;
 }
 
+/* Check if CAND_IDX is a candidate other than OLD_CAND and has
+   cheaper local cost for USE than BEST_CP.  Return pointer to
+   the corresponding cost_pair, otherwise just return BEST_CP.  */
+
+static struct cost_pair*
+cheaper_cost_with_cand (struct ivopts_data *data, struct iv_use *use,
+			unsigned int cand_idx, struct iv_cand *old_cand,
+			struct cost_pair *best_cp)
+{
+  struct iv_cand *cand;
+  struct cost_pair *cp;
+
+  gcc_assert (old_cand != NULL && best_cp != NULL);
+  if (cand_idx == old_cand->id)
+    return best_cp;
+
+  cand = iv_cand (data, cand_idx);
+  cp = get_use_iv_cost (data, use, cand);
+  if (cp != NULL && cheaper_cost_pair (cp, best_cp))
+    return cp;
+
+  return best_cp;
+}
+
+/* Try breaking local optimal fixed-point for IVS by replacing candidates
+   which are used by more than one iv uses.  For each of those candidates,
+   this function tries to represent iv uses under that candidate using
+   other ones with lower local cost, then tries to prune the new set.
+   If the new set has lower cost, It returns the new cost after recording
+   candidate replacement in list DELTA.  */
+
+static comp_cost
+iv_ca_replace (struct ivopts_data *data, struct iv_ca *ivs,
+	       struct iv_ca_delta **delta)
+{
+  bitmap_iterator bi, bj;
+  unsigned int i, j, k;
+  struct iv_use *use;
+  struct iv_cand *cand;
+  comp_cost orig_cost, acost;
+  struct iv_ca_delta *act_delta, *tmp_delta;
+  struct cost_pair *old_cp, *best_cp = NULL;
+
+  *delta = NULL;
+  orig_cost = iv_ca_cost (ivs);
+
+  EXECUTE_IF_SET_IN_BITMAP (ivs->cands, 0, i, bi)
+    {
+      if (ivs->n_cand_uses[i] == 1
+	  || ivs->n_cand_uses[i] > ALWAYS_PRUNE_CAND_SET_BOUND)
+	continue;
+
+      cand = iv_cand (data, i);
+  
+      act_delta = NULL;
+      /*  Represent uses under current candidate using other ones with
+	  lower local cost.  */
+      for (j = 0; j < ivs->upto; j++)
+	{
+	  use = iv_use (data, j);
+	  old_cp = iv_ca_cand_for_use (ivs, use);
+
+	  if (old_cp->cand != cand)
+	    continue;
+
+	  best_cp = old_cp;
+	  if (data->consider_all_candidates)
+	    for (k = 0; k < n_iv_cands (data); k++)
+	      best_cp = cheaper_cost_with_cand (data, use, k,
+						old_cp->cand, best_cp);
+	  else
+	    EXECUTE_IF_SET_IN_BITMAP (use->related_cands, 0, k, bj)
+	      best_cp = cheaper_cost_with_cand (data, use, k,
+						old_cp->cand, best_cp);
+
+	  if (best_cp == old_cp)
+	    continue;
+
+	  act_delta = iv_ca_delta_add (use, old_cp, best_cp, act_delta);
+	}
+      /* No need for further prune.  */
+      if (!act_delta)
+	continue;
+
+      /* Prune the new candidate set.  */
+      iv_ca_delta_commit (data, ivs, act_delta, true);
+      acost = iv_ca_prune (data, ivs, NULL, &tmp_delta);
+      iv_ca_delta_commit (data, ivs, act_delta, false);
+      act_delta = iv_ca_delta_join (act_delta, tmp_delta);
+
+      if (compare_costs (acost, orig_cost) < 0)
+	{
+	  *delta = act_delta;
+	  return acost;
+	}
+      else
+	iv_ca_delta_free (&act_delta);
+    }
+
+  return orig_cost;
+}
+
 /* Tries to extend the sets IVS in the best possible way in order
    to express the USE.  If ORIGINALP is true, prefer candidates from
    the original set of IVs, otherwise favor important candidates not
@@ -5995,10 +6097,13 @@ get_initial_solution (struct ivopts_data *data, bo
   return ivs;
 }
 
-/* Tries to improve set of induction variables IVS.  */
+/* Tries to improve set of induction variables IVS.  TRY_REPLACE_P
+   points to a bool variable, this function tries to break local
+   optimal fixed-point by replacing candidates in IVS if it's true.  */
 
 static bool
-try_improve_iv_set (struct ivopts_data *data, struct iv_ca *ivs)
+try_improve_iv_set (struct ivopts_data *data,
+		    struct iv_ca *ivs, bool *try_replace_p)
 {
   unsigned i, n_ivs;
   comp_cost acost, best_cost = iv_ca_cost (ivs);
@@ -6042,7 +6147,20 @@ static bool
       /* Try removing the candidates from the set instead.  */
       best_cost = iv_ca_prune (data, ivs, NULL, &best_delta);
 
-      /* Nothing more we can do.  */
+      if (!best_delta && *try_replace_p)
+	{
+	  *try_replace_p = false;
+	  /* So far candidate selecting algorithm tends to choose fewer IVs
+	     so that it can handle cases in which loops have many variables
+	     but the best choice is often to use only one general biv.  One
+	     weakness is it can't handle opposite cases, in which different
+	     candidates should be chosen with respect to each use.  To solve
+	     the problem, we replace candidates in a manner described by the
+	     comments of iv_ca_replace, thus give general algorithm a chance
+	     to break local optimal fixed-point in these cases.  */
+	  best_cost = iv_ca_replace (data, ivs, &best_delta);
+	}
+
       if (!best_delta)
 	return false;
     }
@@ -6061,6 +6179,7 @@ static struct iv_ca *
 find_optimal_iv_set_1 (struct ivopts_data *data, bool originalp)
 {
   struct iv_ca *set;
+  bool try_replace_p = true;
 
   /* Get the initial solution.  */
   set = get_initial_solution (data, originalp);
@@ -6077,7 +6196,7 @@ find_optimal_iv_set_1 (struct ivopts_data *data, b
       iv_ca_dump (data, dump_file, set);
     }
 
-  while (try_improve_iv_set (data, set))
+  while (try_improve_iv_set (data, set, &try_replace_p))
     {
       if (dump_file && (dump_flags & TDF_DETAILS))
 	{

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH PR62178]Improve candidate selecting in IVOPT, 2nd try.
  2014-12-17  9:54         ` Bin.Cheng
@ 2014-12-17 15:14           ` Richard Biener
  0 siblings, 0 replies; 14+ messages in thread
From: Richard Biener @ 2014-12-17 15:14 UTC (permalink / raw)
  To: Bin.Cheng; +Cc: GCC Patches, Zdenek Dvorak, Jeff Law, Sebastian Pop

On Wed, Dec 17, 2014 at 10:47 AM, Bin.Cheng <amker.cheng@gmail.com> wrote:
> On Tue, Dec 16, 2014 at 4:42 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
>> On Thu, Dec 11, 2014 at 8:08 PM, Richard Biener
>> <richard.guenther@gmail.com> wrote:
>>> On Thu, Dec 11, 2014 at 10:56 AM, Bin.Cheng <amker.cheng@gmail.com> wrote:
>>>> On Wed, Dec 10, 2014 at 9:47 PM, Richard Biener
>>>> <richard.guenther@gmail.com> wrote:
>>>>> On Fri, Dec 5, 2014 at 1:15 PM, Bin Cheng <bin.cheng@arm.com> wrote:
>>>>>> Hi,
>>>>>> Though PR62178 is hidden by recent cost change in aarch64 backend, the ivopt
>>>>>> issue still exists.
>>>>>>
>>>>>> Current candidate selecting algorithm tends to select fewer candidates given
>>>>>> below reasons:
>>>>>>   1) to better handle loops with many induction uses but the best choice is
>>>>>> one generic basic induction variable;
>>>>>>   2) to keep compilation time low.
>>>>>>
>>>>>> One fundamental weakness of the strategy is the opposite situation can't be
>>>>>> handled properly sometimes.  For these cases the best choice is each
>>>>>> induction variable has its own candidate.
>>>>>> This patch fixes the problem by shuffling candidate set after fix-point is
>>>>>> reached by current implementation.  The reason why this strategy works is it
>>>>>> replaces candidate set by selecting local optimal candidate for some
>>>>>> induction uses, and the new candidate set (has lower cost) is exact what we
>>>>>> want in the mentioned case.  Instrumentation data shows this can find better
>>>>>> candidates set for ~6% loops in spec2006 on x86_64, and ~4% on aarch64.
>>>>>>
>>>>>> This patch actually is extension to the first version patch posted at
>>>>>> https://gcc.gnu.org/ml/gcc-patches/2014-09/msg02620.html, that only adds
>>>>>> another selecting pass with special seed set (more or less like the shuffled
>>>>>> set in this patch).  Data also confirms this patch can find optimal sets for
>>>>>> most loops found by the first one, as well as optimal sets for many new
>>>>>> loops.
>>>>>>
>>>>>> Bootstrap and test on x86_64, no regression on benchmarks.  Bootstrap and
>>>>>> test on aarch64.
>>>>>> Since this patch only selects candidate set with lower cost, any regressions
>>>>>> revealed are latent bugs of other components in GCC.
>>>>>> I also collected GCC bootstrap time on x86_64, no regression either.
>>>>>> Is this OK?
>>>>>
>>>>> The algorithm seems to be quadratic in the number of IV candidates
>>>>> (at least):
>>>> Yes, I worried about that too, that's why I measured the bootstrap
>>>> time.  One way is restrict this procedure one time for each loop.  I
>>>> already tried that and it can capture +90% loops.  Is this sounds
>>>> reasonable?
>>>
>>> Yes.  That's my suggestion to handle it in the caller of try_improve_iv_set?
>>>
>>>> BTW, do we have some compilation time benchmarks for GCC?
>>>
>>> There are various testcases linked from PR47344, I don't remember
>>> any particular one putting load on IVOPTs (but I do remember seeing
>>> IVOPTs in the ~25% area in -ftime-report for some testcases).
>>
>
> Hi,
> I further refined the patch.  Specifically, I factored out common
> code, improved comments, and restricted new code in several ways, for
> example, now iv_ca_replace runs exactly one time for each
> find_optimal_iv_set; iv_ca_replace only tries to replace one candidate
> in IVS each time and makes quick return if lower cost set is found;
> most importantly, iv_ca_replace now checks
> ALWAYS_PRUNE_CAND_SET_BOUND.
> The patch is simplified with these changes.  As for compilation time,
> IVOPT isn't regressed obviously for the overloaded case I created,
> also regression in llvm compilation time benchmarks is gone.
>
> I think we could adapt data structure in IVOPT to make it faster, for
> example, record information in candidate about which uses are
> represented by each cand, sort candidates by cost for each iv use.  I
> may do some refactor in next stage1.

Yes, I agree.  A similar thing is to use affine combinations throughout
them to avoid going into/out of that representation all the time (I think
I've suggested that elsewhere).

> Bootstrap on x86_64, test ongoing.  So OK if no regressions?

Ok.

Thanks,
Richard.

> Thanks,
> bin
>
> 2014-12-17  Bin Cheng  <bin.cheng@arm.com>
>
>     PR tree-optimization/62178
>     * tree-ssa-loop-ivopts.c (cheaper_cost_with_cand): New function.
>     (iv_ca_replace): New function.
>     (try_improve_iv_set): New parameter try_replace_p.
>     Break local optimal fixed-point by calling iv_ca_replace.
>     (find_optimal_iv_set_1): Pass new argument to try_improve_iv_set.
>
> gcc/testsuite/ChangeLog
> 2014-12-17  Bin Cheng  <bin.cheng@arm.com>
>
>     PR tree-optimization/62178
>     * gcc.target/aarch64/pr62178.c: New test.

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2014-12-17 14:51 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-12-05 12:15 [PATCH PR62178]Improve candidate selecting in IVOPT, 2nd try Bin Cheng
2014-12-09 22:58 ` Jeff Law
2014-12-10  6:10   ` Bin.Cheng
2014-12-10 13:47 ` Richard Biener
2014-12-11  9:56   ` Bin.Cheng
2014-12-11  9:57     ` Bin.Cheng
2014-12-11 12:09     ` Richard Biener
2014-12-16  8:43       ` Bin.Cheng
2014-12-16  8:52         ` Fwd: " Bin.Cheng
2014-12-16 18:18           ` Sebastian Pop
2014-12-16 13:16         ` Bin.Cheng
2014-12-17  9:54         ` Bin.Cheng
2014-12-17 15:14           ` Richard Biener
2014-12-15 20:59     ` Sebastian Pop

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).