[Patch PR 44576]: Reduce the computation cost in compute_miss

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* [Patch  PR 44576]:  Reduce the computation cost in compute_miss_rate for prefetching loop arrays
@ 2010-06-30  2:00 Fang, Changpeng
  2010-06-30 10:20 ` Richard Guenther
  0 siblings, 1 reply; 11+ messages in thread
From: Fang, Changpeng @ 2010-06-30  2:00 UTC (permalink / raw)
  To: Zdenek Dvorak
  Cc: Richard Guenther, Christian Borntraeger, gcc-patches, uweigand, sebpop

[-- Attachment #1: Type: text/plain, Size: 1501 bytes --]

Hi,

Attached is the patch that partially fixes bug 44576:  testsuite/gfortran.dg/zero_sized_1.f90 with huge compile
time on prefetching + peeling.

The cost of compute_miss_rate in tree-ssa-loop-prefetch.c is found to be very high, and it results in an exploding
compilation time when there are thousands of memory references in the loop.

compute_miss_rate is to compute the possibility (miss_rate) that two memory references fall on difference cache 
lines. We will insert prefetches for both references if the computed miss rate is greater than the experimentally 
determined ACCEPTABLE_MISS_RATE. 

In this patch, compute_miss_rate is renamed as is_miss_rate_acceptable, and does an early return if the computed 
miss rate is already greater than the ACCEPTABLE_MISS_RATE, and thus significantly reduces the cost for such
case. Note that the worst case cost is not affected by this patch. 

This patch reduces the compile time of the test case from 1m20'' to 1m03'' on an amd-linux64 system.
Note that without -fprefetching-loop-arrays, the compile time on the same system is 0m30''. The extra 33'' is mostly
spent in "compute_all_dependences (datarefs, &dependences, vloops, true)".  I am not sure whether the cost in
 dependence analysis for loop should be that high (with several thousands memory references in loop) or there
is a bug in dependence computation. 

The patch passed Bootstrapping and regression tests.

Is this patch OK to commit?

Thanks,

Changpeng

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0005-Reduce-the-cost-in-miss-rate-computation.patch --]
[-- Type: text/x-patch; name="0005-Reduce-the-cost-in-miss-rate-computation.patch", Size: 4582 bytes --]

From 3dbd1a84428979b5e73b2532c6389d31377b929a Mon Sep 17 00:00:00 2001
From: Changpeng Fang <chfang@pathscale.(none)>
Date: Mon, 28 Jun 2010 17:20:10 -0700
Subject: [PATCH 5/5] Reduce the cost in miss rate computation

	* tree-ssa-loop-prefetch.c (compute_miss_rate): Rename to
	is_miss_rate_acceptable. Pull total_positions computation
	out of the loops.  Early return if miss_positions exceeds
	the acceptable threshold.
	* tree-ssa-loop-prefetch.c (prune_ref_by_group_reuse): Call
	is_miss_rate_acceptable after renaming of compute_miss_rate.
---
 gcc/tree-ssa-loop-prefetch.c |   41 +++++++++++++++++++++--------------------
 1 files changed, 21 insertions(+), 20 deletions(-)

diff --git a/gcc/tree-ssa-loop-prefetch.c b/gcc/tree-ssa-loop-prefetch.c
index 27e2b42..f2c95ac 100644
--- a/gcc/tree-ssa-loop-prefetch.c
+++ b/gcc/tree-ssa-loop-prefetch.c
@@ -640,27 +640,29 @@ ddown (HOST_WIDE_INT x, unsigned HOST_WIDE_INT by)
 /* Given a CACHE_LINE_SIZE and two inductive memory references
    with a common STEP greater than CACHE_LINE_SIZE and an address
    difference DELTA, compute the probability that they will fall
-   in different cache lines.  DISTINCT_ITERS is the number of
-   distinct iterations after which the pattern repeats itself.
+   in different cache lines. Return true if the computed miss rate
+   is not greater than the ACCEPTABLE_MISS_RATE.  DISTINCT_ITERS is the
+   number of distinct iterations after which the pattern repeats itself.
    ALIGN_UNIT is the unit of alignment in bytes.  */
 
-static int
-compute_miss_rate (unsigned HOST_WIDE_INT cache_line_size,
+static bool
+is_miss_rate_acceptable (unsigned HOST_WIDE_INT cache_line_size,
 		   HOST_WIDE_INT step, HOST_WIDE_INT delta,
 		   unsigned HOST_WIDE_INT distinct_iters,
 		   int align_unit)
 {
   unsigned align, iter;
-  int total_positions, miss_positions, miss_rate;
+  int total_positions, miss_positions, max_allowed_miss_positions;
   int address1, address2, cache_line1, cache_line2;
 
   /* It always misses if delta is greater than or equal to the cache
-     line size.  */ 
-  if (delta >= cache_line_size)
-    return 1000;
+     line size.  */
+  if (delta >= (HOST_WIDE_INT) cache_line_size)
+    return false;
 
-  total_positions = 0;
   miss_positions = 0;
+  total_positions = (cache_line_size / align_unit) * distinct_iters;
+  max_allowed_miss_positions = (ACCEPTABLE_MISS_RATE * total_positions) / 1000;
 
   /* Iterate through all possible alignments of the first
      memory reference within its cache line.  */
@@ -673,12 +675,14 @@ compute_miss_rate (unsigned HOST_WIDE_INT cache_line_size,
 	address2 = address1 + delta;
 	cache_line1 = address1 / cache_line_size;
 	cache_line2 = address2 / cache_line_size;
-	total_positions += 1;
 	if (cache_line1 != cache_line2)
-	  miss_positions += 1;
+	  {
+	    miss_positions += 1;
+            if (miss_positions > max_allowed_miss_positions)
+	      return false;
+          }
       }
-  miss_rate = 1000 * miss_positions / total_positions;
-  return miss_rate;
+  return true;
 }
 
 /* Prune the prefetch candidate REF using the reuse with BY.
@@ -694,7 +698,6 @@ prune_ref_by_group_reuse (struct mem_ref *ref, struct mem_ref *by,
   HOST_WIDE_INT delta = delta_b - delta_r;
   HOST_WIDE_INT hit_from;
   unsigned HOST_WIDE_INT prefetch_before, prefetch_block;
-  int miss_rate;
   HOST_WIDE_INT reduced_step;
   unsigned HOST_WIDE_INT reduced_prefetch_block;
   tree ref_type;
@@ -793,9 +796,8 @@ prune_ref_by_group_reuse (struct mem_ref *ref, struct mem_ref *by,
   delta %= step;
   ref_type = TREE_TYPE (ref->mem);
   align_unit = TYPE_ALIGN (ref_type) / 8;
-  miss_rate = compute_miss_rate(prefetch_block, step, delta,
-				reduced_prefetch_block, align_unit);
-  if (miss_rate <= ACCEPTABLE_MISS_RATE)
+  if (is_miss_rate_acceptable (prefetch_block, step, delta,
+			       reduced_prefetch_block, align_unit))
     {
       /* Do not reduce prefetch_before if we meet beyond cache size.  */
       if (prefetch_before > L2_CACHE_SIZE_BYTES / PREFETCH_BLOCK)
@@ -809,9 +811,8 @@ prune_ref_by_group_reuse (struct mem_ref *ref, struct mem_ref *by,
   /* Try also the following iteration.  */
   prefetch_before++;
   delta = step - delta;
-  miss_rate = compute_miss_rate(prefetch_block, step, delta,
-				reduced_prefetch_block, align_unit);
-  if (miss_rate <= ACCEPTABLE_MISS_RATE)
+  if (is_miss_rate_acceptable (prefetch_block, step, delta,
+			       reduced_prefetch_block, align_unit))
     {
       if (prefetch_before < ref->prefetch_before)
 	ref->prefetch_before = prefetch_before;
-- 
1.6.3.3


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Patch PR 44576]: Reduce the computation cost in  compute_miss_rate for prefetching loop arrays
  2010-06-30  2:00 [Patch PR 44576]: Reduce the computation cost in compute_miss_rate for prefetching loop arrays Fang, Changpeng
@ 2010-06-30 10:20 ` Richard Guenther
  2010-06-30 18:33   ` Fang, Changpeng
  2010-07-02 16:42   ` Sebastian Pop
  0 siblings, 2 replies; 11+ messages in thread
From: Richard Guenther @ 2010-06-30 10:20 UTC (permalink / raw)
  To: Fang, Changpeng
  Cc: Zdenek Dvorak, Christian Borntraeger, gcc-patches, uweigand, sebpop

On Wed, Jun 30, 2010 at 2:06 AM, Fang, Changpeng <Changpeng.Fang@amd.com> wrote:
> Hi,
>
> Attached is the patch that partially fixes bug 44576:  testsuite/gfortran.dg/zero_sized_1.f90 with huge compile
> time on prefetching + peeling.
>
> The cost of compute_miss_rate in tree-ssa-loop-prefetch.c is found to be very high, and it results in an exploding
> compilation time when there are thousands of memory references in the loop.
>
> compute_miss_rate is to compute the possibility (miss_rate) that two memory references fall on difference cache
> lines. We will insert prefetches for both references if the computed miss rate is greater than the experimentally
> determined ACCEPTABLE_MISS_RATE.
>
> In this patch, compute_miss_rate is renamed as is_miss_rate_acceptable, and does an early return if the computed
> miss rate is already greater than the ACCEPTABLE_MISS_RATE, and thus significantly reduces the cost for such
> case. Note that the worst case cost is not affected by this patch.
>
> This patch reduces the compile time of the test case from 1m20'' to 1m03'' on an amd-linux64 system.
> Note that without -fprefetching-loop-arrays, the compile time on the same system is 0m30''. The extra 33'' is mostly
> spent in "compute_all_dependences (datarefs, &dependences, vloops, true)".  I am not sure whether the cost in
>  dependence analysis for loop should be that high (with several thousands memory references in loop) or there
> is a bug in dependence computation.

Well, it seems that you compute the same information over-and-over
by doing

unsigned int
tree_ssa_prefetch_arrays (void)
{
...
  FOR_EACH_LOOP (li, loop, LI_FROM_INNERMOST)
    {
      if (dump_file && (dump_flags & TDF_DETAILS))
        fprintf (dump_file, "Processing loop %d:\n", loop->num);

      unrolled |= loop_prefetch_arrays (loop);

static bool
loop_prefetch_arrays (struct loop *loop)
{
...
 determine_loop_nest_reuse (loop, refs, no_other_refs);

static void
determine_loop_nest_reuse (struct loop *loop, struct mem_ref_group *refs,
                           bool no_other_refs)
{
...
  if (loop->inner)
    return;
...
  /* Find the outermost loop of the loop nest of loop (we require that
     there are no sibling loops inside the nest).  */
  nest = loop;
  while (1)
    {
      aloop = loop_outer (nest);

      if (aloop == current_loops->tree_root
          || aloop->inner->next)
        break;

      nest = aloop;
    }
...
  compute_all_dependences (datarefs, &dependences, vloops, true);


In fact you should iterate _only_ over innermost loops like

  FOR_EACH_LOOP (li, loop, LI_ONLY_INNERMOST)

does that make a difference?

>
> The patch passed Bootstrapping and regression tests.
>
> Is this patch OK to commit?

Ok.

Thanks,
Richard.

> Thanks,
>
> Changpeng

^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: [Patch PR 44576]: Reduce the computation cost in  compute_miss_rate for prefetching loop arrays
  2010-06-30 10:20 ` Richard Guenther
@ 2010-06-30 18:33   ` Fang, Changpeng
  2010-06-30 18:35     ` Richard Guenther
  2010-07-02 16:42   ` Sebastian Pop
  1 sibling, 1 reply; 11+ messages in thread
From: Fang, Changpeng @ 2010-06-30 18:33 UTC (permalink / raw)
  To: Richard Guenther
  Cc: Zdenek Dvorak, Christian Borntraeger, gcc-patches, uweigand, sebpop

> FOR_EACH_LOOP (li, loop, LI_ONLY_INNERMOST)

>does that make a difference?

This doesn't help, because "compute_all_dependences" was called the same number of time
as before.

(BTW, should we limit prefetching only to the innermost one?)

In this test case, there are 6 large loops, where each loop has 729 memory reference.
It takes 4~5 seconds to "compute_all_dependence" for one such loop.


> The patch passed Bootstrapping and regression tests.
>
> Is this patch OK to commit?

>Ok.

 Thank you.

Changpeng

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Patch PR 44576]: Reduce the computation cost in  compute_miss_rate for prefetching loop arrays
  2010-06-30 18:33   ` Fang, Changpeng
@ 2010-06-30 18:35     ` Richard Guenther
  2010-06-30 18:39       ` Zdenek Dvorak
  0 siblings, 1 reply; 11+ messages in thread
From: Richard Guenther @ 2010-06-30 18:35 UTC (permalink / raw)
  To: Fang, Changpeng
  Cc: Zdenek Dvorak, Christian Borntraeger, gcc-patches, uweigand, sebpop

On Wed, Jun 30, 2010 at 7:34 PM, Fang, Changpeng <Changpeng.Fang@amd.com> wrote:
>> FOR_EACH_LOOP (li, loop, LI_ONLY_INNERMOST)
>
>>does that make a difference?
>
> This doesn't help, because "compute_all_dependences" was called the same number of time
> as before.
>
> (BTW, should we limit prefetching only to the innermost one?)
>
> In this test case, there are 6 large loops, where each loop has 729 memory reference.
> It takes 4~5 seconds to "compute_all_dependence" for one such loop.

It shouldn't take that long.  Can you gather a more detailed profile?

>
>> The patch passed Bootstrapping and regression tests.
>>
>> Is this patch OK to commit?
>
>>Ok.
>
>  Thank you.
>
> Changpeng
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Patch PR 44576]: Reduce the computation cost in compute_miss_rate for prefetching loop arrays
  2010-06-30 18:35     ` Richard Guenther
@ 2010-06-30 18:39       ` Zdenek Dvorak
  2010-06-30 19:10         ` Richard Guenther
  2010-06-30 23:07         ` Fang, Changpeng
  0 siblings, 2 replies; 11+ messages in thread
From: Zdenek Dvorak @ 2010-06-30 18:39 UTC (permalink / raw)
  To: Richard Guenther
  Cc: Fang, Changpeng, Christian Borntraeger, gcc-patches, uweigand, sebpop

Hi,

> On Wed, Jun 30, 2010 at 7:34 PM, Fang, Changpeng <Changpeng.Fang@amd.com> wrote:
> >> FOR_EACH_LOOP (li, loop, LI_ONLY_INNERMOST)
> >
> >>does that make a difference?
> >
> > This doesn't help, because "compute_all_dependences" was called the same number of time
> > as before.
> >
> > (BTW, should we limit prefetching only to the innermost one?)
> >
> > In this test case, there are 6 large loops, where each loop has 729 memory reference.
> > It takes 4~5 seconds to "compute_all_dependence" for one such loop.
> 
> It shouldn't take that long.  Can you gather a more detailed profile?

actually, compute_all_dependences is quadratic in the number of memory
references, and in more complicated cases, it can perform rather complex
computations, so 5 seconds on 729 references does seem like a realistic time.
Of course, we need to either speed up the analysis or add an cut-off to avoid
it on loops with too many memory references (or both),

Zdenek

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Patch PR 44576]: Reduce the computation cost in  compute_miss_rate for prefetching loop arrays
  2010-06-30 19:10         ` Richard Guenther
@ 2010-06-30 18:46           ` Sebastian Pop
  2010-06-30 18:54             ` Richard Guenther
  0 siblings, 1 reply; 11+ messages in thread
From: Sebastian Pop @ 2010-06-30 18:46 UTC (permalink / raw)
  To: Richard Guenther
  Cc: Zdenek Dvorak, Fang, Changpeng, Christian Borntraeger,
	gcc-patches, uweigand

On Wed, Jun 30, 2010 at 13:05, Richard Guenther
<richard.guenther@gmail.com> wrote:
> On Wed, Jun 30, 2010 at 8:01 PM, Zdenek Dvorak <rakdver@kam.mff.cuni.cz> wrote:
>> Hi,
>>
>>> On Wed, Jun 30, 2010 at 7:34 PM, Fang, Changpeng <Changpeng.Fang@amd.com> wrote:
>>> >> FOR_EACH_LOOP (li, loop, LI_ONLY_INNERMOST)
>>> >
>>> >>does that make a difference?
>>> >
>>> > This doesn't help, because "compute_all_dependences" was called the same number of time
>>> > as before.
>>> >
>>> > (BTW, should we limit prefetching only to the innermost one?)
>>> >
>>> > In this test case, there are 6 large loops, where each loop has 729 memory reference.
>>> > It takes 4~5 seconds to "compute_all_dependence" for one such loop.
>>>
>>> It shouldn't take that long.  Can you gather a more detailed profile?
>>
>> actually, compute_all_dependences is quadratic in the number of memory
>> references, and in more complicated cases, it can perform rather complex
>> computations, so 5 seconds on 729 references does seem like a realistic time.
>> Of course, we need to either speed up the analysis or add an cut-off to avoid
>> it on loops with too many memory references (or both),
>
> Well, but at -O3 the vectorizer computes dependences as well, and it
> doesn't take that much of time,  So there must be something obvious
> going wrong.
>

The dependence analysis in the vectorizer is done only in the innermost
loop that is vectorized, whereas prefetch does the analysis of data deps
for every loop.

Sebastian

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Patch PR 44576]: Reduce the computation cost in  compute_miss_rate for prefetching loop arrays
  2010-06-30 18:46           ` Sebastian Pop
@ 2010-06-30 18:54             ` Richard Guenther
  2010-06-30 19:40               ` Fang, Changpeng
  0 siblings, 1 reply; 11+ messages in thread
From: Richard Guenther @ 2010-06-30 18:54 UTC (permalink / raw)
  To: Sebastian Pop
  Cc: Zdenek Dvorak, Fang, Changpeng, Christian Borntraeger,
	gcc-patches, uweigand

On Wed, Jun 30, 2010 at 8:10 PM, Sebastian Pop <sebpop@gmail.com> wrote:
> On Wed, Jun 30, 2010 at 13:05, Richard Guenther
> <richard.guenther@gmail.com> wrote:
>> On Wed, Jun 30, 2010 at 8:01 PM, Zdenek Dvorak <rakdver@kam.mff.cuni.cz> wrote:
>>> Hi,
>>>
>>>> On Wed, Jun 30, 2010 at 7:34 PM, Fang, Changpeng <Changpeng.Fang@amd.com> wrote:
>>>> >> FOR_EACH_LOOP (li, loop, LI_ONLY_INNERMOST)
>>>> >
>>>> >>does that make a difference?
>>>> >
>>>> > This doesn't help, because "compute_all_dependences" was called the same number of time
>>>> > as before.
>>>> >
>>>> > (BTW, should we limit prefetching only to the innermost one?)
>>>> >
>>>> > In this test case, there are 6 large loops, where each loop has 729 memory reference.
>>>> > It takes 4~5 seconds to "compute_all_dependence" for one such loop.
>>>>
>>>> It shouldn't take that long.  Can you gather a more detailed profile?
>>>
>>> actually, compute_all_dependences is quadratic in the number of memory
>>> references, and in more complicated cases, it can perform rather complex
>>> computations, so 5 seconds on 729 references does seem like a realistic time.
>>> Of course, we need to either speed up the analysis or add an cut-off to avoid
>>> it on loops with too many memory references (or both),
>>
>> Well, but at -O3 the vectorizer computes dependences as well, and it
>> doesn't take that much of time,  So there must be something obvious
>> going wrong.
>>
>
> The dependence analysis in the vectorizer is done only in the innermost
> loop that is vectorized, whereas prefetch does the analysis of data deps
> for every loop.

The vectorizer also does dependence analysis for outer-loop vectorization.
I guess prefetch analysis should restrict itself to data dependences on a
maximum loop nest depth (of say 2 or 3).

Still dependence analysis looks overly costly here.

Richard.

> Sebastian
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Patch PR 44576]: Reduce the computation cost in  compute_miss_rate for prefetching loop arrays
  2010-06-30 18:39       ` Zdenek Dvorak
@ 2010-06-30 19:10         ` Richard Guenther
  2010-06-30 18:46           ` Sebastian Pop
  2010-06-30 23:07         ` Fang, Changpeng
  1 sibling, 1 reply; 11+ messages in thread
From: Richard Guenther @ 2010-06-30 19:10 UTC (permalink / raw)
  To: Zdenek Dvorak
  Cc: Fang, Changpeng, Christian Borntraeger, gcc-patches, uweigand, sebpop

On Wed, Jun 30, 2010 at 8:01 PM, Zdenek Dvorak <rakdver@kam.mff.cuni.cz> wrote:
> Hi,
>
>> On Wed, Jun 30, 2010 at 7:34 PM, Fang, Changpeng <Changpeng.Fang@amd.com> wrote:
>> >> FOR_EACH_LOOP (li, loop, LI_ONLY_INNERMOST)
>> >
>> >>does that make a difference?
>> >
>> > This doesn't help, because "compute_all_dependences" was called the same number of time
>> > as before.
>> >
>> > (BTW, should we limit prefetching only to the innermost one?)
>> >
>> > In this test case, there are 6 large loops, where each loop has 729 memory reference.
>> > It takes 4~5 seconds to "compute_all_dependence" for one such loop.
>>
>> It shouldn't take that long.  Can you gather a more detailed profile?
>
> actually, compute_all_dependences is quadratic in the number of memory
> references, and in more complicated cases, it can perform rather complex
> computations, so 5 seconds on 729 references does seem like a realistic time.
> Of course, we need to either speed up the analysis or add an cut-off to avoid
> it on loops with too many memory references (or both),

Well, but at -O3 the vectorizer computes dependences as well, and it
doesn't take that much of time,  So there must be something obvious
going wrong.

Richard.

> Zdenek
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: [Patch PR 44576]: Reduce the computation cost in  compute_miss_rate for prefetching loop arrays
  2010-06-30 18:54             ` Richard Guenther
@ 2010-06-30 19:40               ` Fang, Changpeng
  0 siblings, 0 replies; 11+ messages in thread
From: Fang, Changpeng @ 2010-06-30 19:40 UTC (permalink / raw)
  To: Richard Guenther, Sebastian Pop
  Cc: Zdenek Dvorak, Christian Borntraeger, gcc-patches, uweigand


>>> Well, but at -O3 the vectorizer computes dependences as well, and it
>>> doesn't take that much of time,  So there must be something obvious
>>> going wrong.
>>>
>>
>> The dependence analysis in the vectorizer is done only in the innermost
>> loop that is vectorized, whereas prefetch does the analysis of data deps
>> for every loop.

>The vectorizer also does dependence analysis for outer-loop vectorization.
>I guess prefetch analysis should restrict itself to data dependences on a
>maximum loop nest depth (of say 2 or 3).

For vectorizer (and tree-loop-linear), we are just lucky that the loop could
not be vectorized and dependence analysis was not invoked at all for 
this loop. For this test case, prefetching is the only pass that invokes 
dependence analysis.

>Still dependence analysis looks overly costly here.
Yes, we just understand and reduce the dependence analysis cost.

Thanks,

Changpeng
 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: [Patch PR 44576]: Reduce the computation cost in compute_miss_rate for prefetching loop arrays
  2010-06-30 18:39       ` Zdenek Dvorak
  2010-06-30 19:10         ` Richard Guenther
@ 2010-06-30 23:07         ` Fang, Changpeng
  1 sibling, 0 replies; 11+ messages in thread
From: Fang, Changpeng @ 2010-06-30 23:07 UTC (permalink / raw)
  To: Zdenek Dvorak, Richard Guenther
  Cc: Christian Borntraeger, gcc-patches, uweigand, sebpop


>actually, compute_all_dependences is quadratic in the number of memory
>references, and in more complicated cases, it can perform rather complex
>computations, so 5 seconds on 729 references does seem like a realistic time.
>Of course, we need to either speed up the analysis or add an cut-off to avoid
>it on loops with too many memory references (or both),

Actually, this 4~5 seconds are evenly distributed, and I do think it is due to the
large number of memory references.

Let take a close look at this:

determine_loop_nest_reuse (struct loop *loop, struct mem_ref_group *refs,
			   bool no_other_refs)
{
    ...
    compute_all_dependences (datarefs, &dependences, vloops, true);  /* 3+ seconds */

   for (i = 0; VEC_iterate (ddr_p, dependences, i, dep); i++) /* 1+ second */
    {
      ...
    }
}

The 4~5 seconds are spent in compute_all_dependences and the loop after it to get
the dependence info. The loop iterates 265356 times and takes 1+ second. The loop
is simple and reasonable. Now, let look at compute_all_dependences:

compute_all_dependences (VEC (data_reference_p, heap) *datarefs,
			 VEC (ddr_p, heap) **dependence_relations,
			 VEC (loop_p, heap) *loop_nest,
			 bool compute_self_and_rr)
{
  ...
  for (i = 0; VEC_iterate (data_reference_p, datarefs, i, a); i++)
    for (j = i + 1; VEC_iterate (data_reference_p, datarefs, j, b); j++)
      if (!DR_IS_READ (a) || !DR_IS_READ (b) || compute_self_and_rr)
	{
	  ddr = initialize_data_dependence_relation (a, b, loop_nest); /* 1+ second */
	  VEC_safe_push (ddr_p, heap, *dependence_relations, ddr);
          if (loop_nest)
   	    compute_affine_dependence (ddr, VEC_index (loop_p, loop_nest, 0)); /* 1+ second */
	}
  ...
}

 Each of the functions, initialize_data_dependence_relation and  compute_affine_dependence, are
called  265356 times, and each of them costs 1+ seconds (for  265356 iters). Again, 
initialize_data_dependence_relation is also simple.

As a result, I conclude that the large number of memory references is the reason of slowdown.
So, it is better to add a cut-off (for prefetching at least).

Thanks,

Changpeng





^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Patch PR 44576]: Reduce the computation cost in  compute_miss_rate for prefetching loop arrays
  2010-06-30 10:20 ` Richard Guenther
  2010-06-30 18:33   ` Fang, Changpeng
@ 2010-07-02 16:42   ` Sebastian Pop
  1 sibling, 0 replies; 11+ messages in thread
From: Sebastian Pop @ 2010-07-02 16:42 UTC (permalink / raw)
  To: Richard Guenther
  Cc: Fang, Changpeng, Zdenek Dvorak, Christian Borntraeger,
	gcc-patches, uweigand

On Wed, Jun 30, 2010 at 03:52, Richard Guenther
<richard.guenther@gmail.com> wrote:
> On Wed, Jun 30, 2010 at 2:06 AM, Fang, Changpeng <Changpeng.Fang@amd.com> wrote:
>> Hi,
>>
>> Attached is the patch that partially fixes bug 44576:  testsuite/gfortran.dg/zero_sized_1.f90 with huge compile
>> time on prefetching + peeling.
>>
>> The cost of compute_miss_rate in tree-ssa-loop-prefetch.c is found to be very high, and it results in an exploding
>> compilation time when there are thousands of memory references in the loop.
>>
>> compute_miss_rate is to compute the possibility (miss_rate) that two memory references fall on difference cache
>> lines. We will insert prefetches for both references if the computed miss rate is greater than the experimentally
>> determined ACCEPTABLE_MISS_RATE.
>>
>> In this patch, compute_miss_rate is renamed as is_miss_rate_acceptable, and does an early return if the computed
>> miss rate is already greater than the ACCEPTABLE_MISS_RATE, and thus significantly reduces the cost for such
>> case. Note that the worst case cost is not affected by this patch.
>>
>> This patch reduces the compile time of the test case from 1m20'' to 1m03'' on an amd-linux64 system.
>> Note that without -fprefetching-loop-arrays, the compile time on the same system is 0m30''. The extra 33'' is mostly
>> spent in "compute_all_dependences (datarefs, &dependences, vloops, true)".  I am not sure whether the cost in
>>  dependence analysis for loop should be that high (with several thousands memory references in loop) or there
>> is a bug in dependence computation.
>
> Well, it seems that you compute the same information over-and-over
> by doing
>
> unsigned int
> tree_ssa_prefetch_arrays (void)
> {
> ...
>  FOR_EACH_LOOP (li, loop, LI_FROM_INNERMOST)
>    {
>      if (dump_file && (dump_flags & TDF_DETAILS))
>        fprintf (dump_file, "Processing loop %d:\n", loop->num);
>
>      unrolled |= loop_prefetch_arrays (loop);
>
> static bool
> loop_prefetch_arrays (struct loop *loop)
> {
> ...
>  determine_loop_nest_reuse (loop, refs, no_other_refs);
>
> static void
> determine_loop_nest_reuse (struct loop *loop, struct mem_ref_group *refs,
>                           bool no_other_refs)
> {
> ...
>  if (loop->inner)
>    return;
> ...
>  /* Find the outermost loop of the loop nest of loop (we require that
>     there are no sibling loops inside the nest).  */
>  nest = loop;
>  while (1)
>    {
>      aloop = loop_outer (nest);
>
>      if (aloop == current_loops->tree_root
>          || aloop->inner->next)
>        break;
>
>      nest = aloop;
>    }
> ...
>  compute_all_dependences (datarefs, &dependences, vloops, true);
>
>
> In fact you should iterate _only_ over innermost loops like
>
>  FOR_EACH_LOOP (li, loop, LI_ONLY_INNERMOST)
>
> does that make a difference?
>
>>
>> The patch passed Bootstrapping and regression tests.
>>
>> Is this patch OK to commit?
>
> Ok.
>

Committed r161728

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2010-07-02 16:42 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-06-30  2:00 [Patch PR 44576]: Reduce the computation cost in compute_miss_rate for prefetching loop arrays Fang, Changpeng
2010-06-30 10:20 ` Richard Guenther
2010-06-30 18:33   ` Fang, Changpeng
2010-06-30 18:35     ` Richard Guenther
2010-06-30 18:39       ` Zdenek Dvorak
2010-06-30 19:10         ` Richard Guenther
2010-06-30 18:46           ` Sebastian Pop
2010-06-30 18:54             ` Richard Guenther
2010-06-30 19:40               ` Fang, Changpeng
2010-06-30 23:07         ` Fang, Changpeng
2010-07-02 16:42   ` Sebastian Pop

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).