[RFC] A memory gathering optimization for loop

public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed

* [RFC] A memory gathering optimization for loop
@ 2021-01-14  4:28 Feng Xue OS
  2021-01-17  1:23 ` JiangNing OS
  2021-04-30  6:30 ` Florian Weimer
  0 siblings, 2 replies; 9+ messages in thread
From: Feng Xue OS @ 2021-01-14  4:28 UTC (permalink / raw)
  To: gcc; +Cc: JiangNing OS, Hao Liu OS

1. Background

In a loop, it is optimal if only one memory stream is activated, that
is, all memory operations sequentially access one data region. But that
is always not the case, such as traversing link list and manipulating
discrete arrays. In this scenario, the loop would contain multiple
scattered memory accesses, which could trigger inefficient multi-way
hardware prefetching, thus making cpu cache hit rate drop. The tracker
pr98598 (https://gcc.gnu.org/bugzilla//show_bug.cgi?id=98598) gives a
description on one related problem that we are meant to target.

2. Memory gathering optimization (MGO)

For nested loops, if scattered memory accesses inside inner loop remain
unchanged in each iteration of outer loop, we can dynamically gather
result data into a sequential memory region when the first access in the
inner loop takes place. This way the hardware prefetching can be reduced,
and cpu cache hit rate can be improved. We name this optimization MGO
(memory gathering optimization). An example is given as below, which is
a little different from pr98598, and could not be optimized by loop
distribution.

  struct A { int **p1; };

  int foo (int n, struct A *array)
  {
      int sum = 0;

      for (int i = 0; i < n; i++) {
          for (int j = 0; j < i; j++) {  /* iteration count is i */
              int **p1 = array[j].p1;
              int  *p2 = *p1;

              sum += *p2;
          }
      }

      return sum;
  }

Although this example seems to be pretty artificial, the kind of data
reuse in nested loops is often seen in real applications, especially
written by Fortran, and C++ STL.

To simplify explanation of this memory gathering optimization, pseudo
code on original nested loops is given as:

  outer-loop ( ) { /* data in inner loop are also invariant in outer loop.  */
      ...
      inner-loop (iter, iter_count) {  /* inner loop to apply MGO */

          Type1 v1 = LOAD_FN1 (iter);
          Type2 v2 = LOAD_FN2 (v1);
          Type3 v3 = LOAD_FN3 (v2);
          ... 
          iter = NEXT_FN (iter);
      }

     ...
  }

"LOAD_FNx()" is a conceptual function that translates its argument to a
result, and is composed of a sequence of operations in which there is
only one memory load. It is capable of representing simple memory
dereference like "iter->field", or more complicated expression like
"array[iter * 2 + cst].field".

Likewise, "NEXT_FN()" is also a conceptual function which defines how an
iterator is advanced. It is able to describe simple integer iterator,
list iterator, or even pure-function-based iterator on advanced stuff
like hashset/tree.

Then the code will be transformed to:

  typedef struct cache_elem {
      bool   init;
      Type1  c_v1;
      Type2  c_v2;
      Type3  c_v3;
  } cache_elem;

  cache_elem *cache_arr = calloc (iter_count, sizeof(cache_elem));

  outer-loop () {
      ...
      size_t cache_idx = 0;

      inner-loop (iter, iter_count) {

          if (!cache_arr[cache_idx]->init) {
              v1 = LOAD_FN1 (iter);
              v2 = LOAD_FN2 (v1);
              v3 = LOAD_FN3 (v2);

              cache_arr[cache_idx]->init = true;
              cache_arr[cache_idx]->c_v1 = v1;
              cache_arr[cache_idx]->c_v2 = v2;
              cache_arr[cache_idx]->c_v3 = v3;
          }
          else {
              v1 = cache_arr[cache_idx]->c_v1;
              v2 = cache_arr[cache_idx]->c_v2;
              v3 = cache_arr[cache_idx]->c_v3;
          }
          ...
          cache_idx++;
          iter = NEXT_FN (iter);
      }
      ...
  }

  free (cache_arr);

In the transformation, cache space belongs to compiler-generated
temporary storage, but it is from user heap. We find this trick is very
uncommon in gcc. We’d like to make sure whether there are any special
considerations or restrictions on this. Anyway, using alloca() to get
space from stack is an alternative, but we should be cautious to avoid
incurring stack overflow by non-user logic.

"iter_count" stands for an upper bound for inner-loop iteration count,
which must be an outer-loop invariant expression, in that it determines
data count in cache space, and the space is allocated before outer-loop.

In the example below, we can simply get such bound for inner-loop 1 from
its niter_desc, while for inner-loop 2, it is not that easy, but it is
possible to infer that the loop's iteration count never exceeds "n" via
comparison predicate information.

  foo (int n, int m)
  {
      for (int i = 0; i < n; i++ ) {      /* outer-loop 1 */
          ...
          for (int j = 0; j < m; j++) {   /* inner-loop 1 */
              ...
          }
          ...
      }

      for (int i = 0; i < n; i ++) {     /* outer-loop 2 */
          ...
          for (int j = 0; j < i; j++) {  /* inner-loop 2 */
              ...
          }
          ...
      }
  }

3. Cache count threshold

We expect cache space works as normal stack temporary, and would not
impact execution of program so much, especially when heap is heavily
used in user code, and at least should not exhaust memory and make the
program broken. For this, a maximum cache count threshold needs to be
introduced.

Since memory accesses are dynamically gathered at runtime, it has cost,
which lies in cache data filling, status check and space allocation.
Runtime overhead of MGO could outweigh performance benefit if cache data
count is too small, so we also need a minimum cache count threshold.

With a new runtime check against two cache data count thresholds, MGO
transformation will be:

  trans_ok = iter_count >= min_cache_data_count &&
             iter_count <= max_cache_data_count;

  if (trans_ok) {
      cache_arr = calloc (iter_count, sizeof(cache_elem));
  }

  outer-loop () {
      ...
      if (trans_ok)  {
          /* transformed inner-loop */
      }
      else {
          /* original inner-loop */
      }
  }

  if (trans_ok) {
      free (cache_arr);
  }

4. Runtime alias check

In baseline MGO, memory disambiguation for candidate load is based on
static alias analysis, which always provides very conservative result.
A more aggressive option is to use runtime alias check, which could be
easily integrated into data-caching code. In the following pseudo code,
a store outside inner-loop is aliased with the candidate load, and it
prevents the load and its dependent loads from being optimized if we
only apply the optimization via static alias analysis.

  Type1 v0;

  outer-loop () {

      v0->field = value;  /* Aliased with v1->field */

      inner-loop (iter) {

          Type1 v1 = LOAD_FN1 (iter);
          Type2 v2 = v1->field;
          ...
      }
     ...
  }

When runtime alias check is enabled, MGO will generate code to
incrementally compute address range for candidate load, and check
address overlap for may-alias stores and the range at runtime. Once
conflict is detected, execution will switch to original inner-loop. With
runtime alias check added, transformation will be:

  Type1 v0;
  std::pair<void *, void *> v1_addr_range;

  outer-loop () {

      v0->field = value;
      ...
      /* Check whether v0 is covered by range.  */
      trans_ok &= !IN_RANGE (&v1_addr_range, v0);

      if (trans_ok) {

          inner-loop (iter) {

              if (!cache_arr[cache_idx]->init) {
                 Type1 v1 = LOAD_FN1 (iter);
                 Type2 v2 = v1->field;

                 /* Extend range to cover v1. */
                 UPDATE_RANGE (&v1_addr_range, v1);
                 ...
              }
              ...
          }
      }
      else {
         /* original inner-loop */
      }
      ...
  }

Currently, we are working on an implementation for this proposal. For
sure, it might be limited, so hope your comments on it. Thanks.

Best Regards,

Feng

^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: [RFC] A memory gathering optimization for loop
  2021-01-14  4:28 [RFC] A memory gathering optimization for loop Feng Xue OS
@ 2021-01-17  1:23 ` JiangNing OS
  2021-01-17 11:32   ` Richard Biener
  2021-04-30  6:30 ` Florian Weimer
  1 sibling, 1 reply; 9+ messages in thread
From: JiangNing OS @ 2021-01-17  1:23 UTC (permalink / raw)
  To: gcc

> -----Original Message-----
> From: Feng Xue OS <fxue@os.amperecomputing.com>
> Sent: Thursday, January 14, 2021 12:28 PM
> To: gcc@gcc.gnu.org
> Cc: JiangNing OS <jiangning@os.amperecomputing.com>; Hao Liu OS
> <hliu@os.amperecomputing.com>
> Subject: [RFC] A memory gathering optimization for loop
> 
> 1. Background
> 
> In a loop, it is optimal if only one memory stream is activated, that is, all
> memory operations sequentially access one data region. But that is always
> not the case, such as traversing link list and manipulating discrete arrays. In
> this scenario, the loop would contain multiple scattered memory accesses,
> which could trigger inefficient multi-way hardware prefetching, thus making
> cpu cache hit rate drop. The tracker
> pr98598 (https://gcc.gnu.org/bugzilla//show_bug.cgi?id=98598) gives a
> description on one related problem that we are meant to target.
> 
> 2. Memory gathering optimization (MGO)
> 
> For nested loops, if scattered memory accesses inside inner loop remain
> unchanged in each iteration of outer loop, we can dynamically gather result
> data into a sequential memory region when the first access in the inner loop
> takes place. This way the hardware prefetching can be reduced, and cpu
> cache hit rate can be improved. We name this optimization MGO (memory
> gathering optimization). An example is given as below, which is a little
> different from pr98598, and could not be optimized by loop distribution.
> 
>   struct A { int **p1; };
> 
>   int foo (int n, struct A *array)
>   {
>       int sum = 0;
> 
>       for (int i = 0; i < n; i++) {
>           for (int j = 0; j < i; j++) {  /* iteration count is i */
>               int **p1 = array[j].p1;
>               int  *p2 = *p1;
> 
>               sum += *p2;
>           }
>       }
> 
>       return sum;
>   }
> 
> Although this example seems to be pretty artificial, the kind of data reuse in
> nested loops is often seen in real applications, especially written by Fortran,
> and C++ STL.
> 
> To simplify explanation of this memory gathering optimization, pseudo code
> on original nested loops is given as:
> 
>   outer-loop ( ) { /* data in inner loop are also invariant in outer loop.  */
>       ...
>       inner-loop (iter, iter_count) {  /* inner loop to apply MGO */
> 
>           Type1 v1 = LOAD_FN1 (iter);
>           Type2 v2 = LOAD_FN2 (v1);
>           Type3 v3 = LOAD_FN3 (v2);
>           ...
>           iter = NEXT_FN (iter);
>       }
> 
>      ...
>   }
> 
> "LOAD_FNx()" is a conceptual function that translates its argument to a result,
> and is composed of a sequence of operations in which there is only one
> memory load. It is capable of representing simple memory dereference like
> "iter->field", or more complicated expression like "array[iter * 2 + cst].field".
> 
> Likewise, "NEXT_FN()" is also a conceptual function which defines how an
> iterator is advanced. It is able to describe simple integer iterator, list iterator,
> or even pure-function-based iterator on advanced stuff like hashset/tree.
> 
> Then the code will be transformed to:
> 
>   typedef struct cache_elem {
>       bool   init;
>       Type1  c_v1;
>       Type2  c_v2;
>       Type3  c_v3;
>   } cache_elem;
> 
>   cache_elem *cache_arr = calloc (iter_count, sizeof(cache_elem));
> 
>   outer-loop () {
>       ...
>       size_t cache_idx = 0;
> 
>       inner-loop (iter, iter_count) {
> 
>           if (!cache_arr[cache_idx]->init) {
>               v1 = LOAD_FN1 (iter);
>               v2 = LOAD_FN2 (v1);
>               v3 = LOAD_FN3 (v2);
> 
>               cache_arr[cache_idx]->init = true;
>               cache_arr[cache_idx]->c_v1 = v1;
>               cache_arr[cache_idx]->c_v2 = v2;
>               cache_arr[cache_idx]->c_v3 = v3;
>           }
>           else {
>               v1 = cache_arr[cache_idx]->c_v1;
>               v2 = cache_arr[cache_idx]->c_v2;
>               v3 = cache_arr[cache_idx]->c_v3;
>           }
>           ...
>           cache_idx++;
>           iter = NEXT_FN (iter);
>       }
>       ...
>   }
> 
>   free (cache_arr);

The example in PR#98598 preloads the data before the outer loop, while the
transformation in this RFC lazily caches the data in the inner loop, which is to
avoid introducing unnecessary data RW hazards.

> 
> In the transformation, cache space belongs to compiler-generated temporary
> storage, but it is from user heap. We find this trick is very uncommon in gcc.
> We'd like to make sure whether there are any special considerations or
> restrictions on this. Anyway, using alloca() to get space from stack is an
> alternative, but we should be cautious to avoid incurring stack overflow by
> non-user logic.
> 
> "iter_count" stands for an upper bound for inner-loop iteration count, which
> must be an outer-loop invariant expression, in that it determines data count
> in cache space, and the space is allocated before outer-loop.
> 
> In the example below, we can simply get such bound for inner-loop 1 from
> its niter_desc, while for inner-loop 2, it is not that easy, but it is possible to
> infer that the loop's iteration count never exceeds "n" via comparison
> predicate information.
> 
>   foo (int n, int m)
>   {
>       for (int i = 0; i < n; i++ ) {      /* outer-loop 1 */
>           ...
>           for (int j = 0; j < m; j++) {   /* inner-loop 1 */
>               ...
>           }
>           ...
>       }
> 
>       for (int i = 0; i < n; i ++) {     /* outer-loop 2 */
>           ...
>           for (int j = 0; j < i; j++) {  /* inner-loop 2 */
>               ...
>           }
>           ...
>       }
>   }
> 
> 3. Cache count threshold
> 
> We expect cache space works as normal stack temporary, and would not
> impact execution of program so much, especially when heap is heavily used
> in user code, and at least should not exhaust memory and make the program
> broken. For this, a maximum cache count threshold needs to be introduced.
> 
> Since memory accesses are dynamically gathered at runtime, it has cost,
> which lies in cache data filling, status check and space allocation.
> Runtime overhead of MGO could outweigh performance benefit if cache data
> count is too small, so we also need a minimum cache count threshold.
> 
> With a new runtime check against two cache data count thresholds, MGO
> transformation will be:
> 
>   trans_ok = iter_count >= min_cache_data_count &&
>              iter_count <= max_cache_data_count;
> 
>   if (trans_ok) {
>       cache_arr = calloc (iter_count, sizeof(cache_elem));
>   }
> 
>   outer-loop () {
>       ...
>       if (trans_ok)  {
>           /* transformed inner-loop */
>       }
>       else {
>           /* original inner-loop */
>       }
>   }
> 
>   if (trans_ok) {
>       free (cache_arr);
>   }
> 
> 4. Runtime alias check
> 
> In baseline MGO, memory disambiguation for candidate load is based on
> static alias analysis, which always provides very conservative result.
> A more aggressive option is to use runtime alias check, which could be easily
> integrated into data-caching code. In the following pseudo code, a store
> outside inner-loop is aliased with the candidate load, and it prevents the load
> and its dependent loads from being optimized if we only apply the
> optimization via static alias analysis.
> 
>   Type1 v0;
> 
>   outer-loop () {
> 
>       v0->field = value;  /* Aliased with v1->field */
> 
>       inner-loop (iter) {
> 
>           Type1 v1 = LOAD_FN1 (iter);
>           Type2 v2 = v1->field;
>           ...
>       }
>      ...
>   }
> 
> 
> When runtime alias check is enabled, MGO will generate code to
> incrementally compute address range for candidate load, and check address
> overlap for may-alias stores and the range at runtime. Once conflict is
> detected, execution will switch to original inner-loop. With runtime alias
> check added, transformation will be:
> 
>   Type1 v0;
>   std::pair<void *, void *> v1_addr_range;
> 
>   outer-loop () {
> 
>       v0->field = value;
>       ...
>       /* Check whether v0 is covered by range.  */
>       trans_ok &= !IN_RANGE (&v1_addr_range, v0);
> 
>       if (trans_ok) {
> 
>           inner-loop (iter) {
> 
>               if (!cache_arr[cache_idx]->init) {
>                  Type1 v1 = LOAD_FN1 (iter);
>                  Type2 v2 = v1->field;
> 
>                  /* Extend range to cover v1. */
>                  UPDATE_RANGE (&v1_addr_range, v1);
>                  ...
>               }
>               ...
>           }
>       }
>       else {
>          /* original inner-loop */
>       }
>       ...
>   }
> 
> Currently, we are working on an implementation for this proposal. For sure,
> it might be limited, so hope your comments on it. Thanks.
> 
> Best Regards,
> 
> Feng

^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: [RFC] A memory gathering optimization for loop
  2021-01-17  1:23 ` JiangNing OS
@ 2021-01-17 11:32   ` Richard Biener
  2021-01-18  3:15     ` Feng Xue OS
  0 siblings, 1 reply; 9+ messages in thread
From: Richard Biener @ 2021-01-17 11:32 UTC (permalink / raw)
  To: JiangNing OS, JiangNing OS via Gcc, gcc

On January 17, 2021 2:23:55 AM GMT+01:00, JiangNing OS via Gcc <gcc@gcc.gnu.org> wrote:
>> -----Original Message-----
>> From: Feng Xue OS <fxue@os.amperecomputing.com>
>> Sent: Thursday, January 14, 2021 12:28 PM
>> To: gcc@gcc.gnu.org
>> Cc: JiangNing OS <jiangning@os.amperecomputing.com>; Hao Liu OS
>> <hliu@os.amperecomputing.com>
>> Subject: [RFC] A memory gathering optimization for loop
>> 
>> 1. Background
>> 
>> In a loop, it is optimal if only one memory stream is activated, that
>is, all
>> memory operations sequentially access one data region. But that is
>always
>> not the case, such as traversing link list and manipulating discrete
>arrays. In
>> this scenario, the loop would contain multiple scattered memory
>accesses,
>> which could trigger inefficient multi-way hardware prefetching, thus
>making
>> cpu cache hit rate drop. The tracker
>> pr98598 (https://gcc.gnu.org/bugzilla//show_bug.cgi?id=98598) gives a
>> description on one related problem that we are meant to target.
>> 
>> 2. Memory gathering optimization (MGO)
>> 
>> For nested loops, if scattered memory accesses inside inner loop
>remain
>> unchanged in each iteration of outer loop, we can dynamically gather
>result
>> data into a sequential memory region when the first access in the
>inner loop
>> takes place. This way the hardware prefetching can be reduced, and
>cpu
>> cache hit rate can be improved. We name this optimization MGO (memory
>> gathering optimization). An example is given as below, which is a
>little
>> different from pr98598, and could not be optimized by loop
>distribution.
>> 
>>   struct A { int **p1; };
>> 
>>   int foo (int n, struct A *array)
>>   {
>>       int sum = 0;
>> 
>>       for (int i = 0; i < n; i++) {
>>           for (int j = 0; j < i; j++) {  /* iteration count is i */
>>               int **p1 = array[j].p1;
>>               int  *p2 = *p1;
>> 
>>               sum += *p2;
>>           }
>>       }
>> 
>>       return sum;
>>   }
>> 

The number of memory streams in this loop is limited (practically 1), how would the transform be affected when issuing prefetches for array at a well placed spot? 

That is, what's the hazard we avoid (besides the obvious data dependence
Shortening in case the load from array is in cache?) 

Richard. 

>> Although this example seems to be pretty artificial, the kind of data
>reuse in
>> nested loops is often seen in real applications, especially written
>by Fortran,
>> and C++ STL.
>> 
>> To simplify explanation of this memory gathering optimization, pseudo
>code
>> on original nested loops is given as:
>> 
>>   outer-loop ( ) { /* data in inner loop are also invariant in outer
>loop.  */
>>       ...
>>       inner-loop (iter, iter_count) {  /* inner loop to apply MGO */
>> 
>>           Type1 v1 = LOAD_FN1 (iter);
>>           Type2 v2 = LOAD_FN2 (v1);
>>           Type3 v3 = LOAD_FN3 (v2);
>>           ...
>>           iter = NEXT_FN (iter);
>>       }
>> 
>>      ...
>>   }
>> 
>> "LOAD_FNx()" is a conceptual function that translates its argument to
>a result,
>> and is composed of a sequence of operations in which there is only
>one
>> memory load. It is capable of representing simple memory dereference
>like
>> "iter->field", or more complicated expression like "array[iter * 2 +
>cst].field".
>> 
>> Likewise, "NEXT_FN()" is also a conceptual function which defines how
>an
>> iterator is advanced. It is able to describe simple integer iterator,
>list iterator,
>> or even pure-function-based iterator on advanced stuff like
>hashset/tree.
>> 
>> Then the code will be transformed to:
>> 
>>   typedef struct cache_elem {
>>       bool   init;
>>       Type1  c_v1;
>>       Type2  c_v2;
>>       Type3  c_v3;
>>   } cache_elem;
>> 
>>   cache_elem *cache_arr = calloc (iter_count, sizeof(cache_elem));
>> 
>>   outer-loop () {
>>       ...
>>       size_t cache_idx = 0;
>> 
>>       inner-loop (iter, iter_count) {
>> 
>>           if (!cache_arr[cache_idx]->init) {
>>               v1 = LOAD_FN1 (iter);
>>               v2 = LOAD_FN2 (v1);
>>               v3 = LOAD_FN3 (v2);
>> 
>>               cache_arr[cache_idx]->init = true;
>>               cache_arr[cache_idx]->c_v1 = v1;
>>               cache_arr[cache_idx]->c_v2 = v2;
>>               cache_arr[cache_idx]->c_v3 = v3;
>>           }
>>           else {
>>               v1 = cache_arr[cache_idx]->c_v1;
>>               v2 = cache_arr[cache_idx]->c_v2;
>>               v3 = cache_arr[cache_idx]->c_v3;
>>           }
>>           ...
>>           cache_idx++;
>>           iter = NEXT_FN (iter);
>>       }
>>       ...
>>   }
>> 
>>   free (cache_arr);
>
>The example in PR#98598 preloads the data before the outer loop, while
>the
>transformation in this RFC lazily caches the data in the inner loop,
>which is to
>avoid introducing unnecessary data RW hazards.
>
>> 
>> In the transformation, cache space belongs to compiler-generated
>temporary
>> storage, but it is from user heap. We find this trick is very
>uncommon in gcc.
>> We'd like to make sure whether there are any special considerations
>or
>> restrictions on this. Anyway, using alloca() to get space from stack
>is an
>> alternative, but we should be cautious to avoid incurring stack
>overflow by
>> non-user logic.
>> 
>> "iter_count" stands for an upper bound for inner-loop iteration
>count, which
>> must be an outer-loop invariant expression, in that it determines
>data count
>> in cache space, and the space is allocated before outer-loop.
>> 
>> In the example below, we can simply get such bound for inner-loop 1
>from
>> its niter_desc, while for inner-loop 2, it is not that easy, but it
>is possible to
>> infer that the loop's iteration count never exceeds "n" via
>comparison
>> predicate information.
>> 
>>   foo (int n, int m)
>>   {
>>       for (int i = 0; i < n; i++ ) {      /* outer-loop 1 */
>>           ...
>>           for (int j = 0; j < m; j++) {   /* inner-loop 1 */
>>               ...
>>           }
>>           ...
>>       }
>> 
>>       for (int i = 0; i < n; i ++) {     /* outer-loop 2 */
>>           ...
>>           for (int j = 0; j < i; j++) {  /* inner-loop 2 */
>>               ...
>>           }
>>           ...
>>       }
>>   }
>> 
>> 3. Cache count threshold
>> 
>> We expect cache space works as normal stack temporary, and would not
>> impact execution of program so much, especially when heap is heavily
>used
>> in user code, and at least should not exhaust memory and make the
>program
>> broken. For this, a maximum cache count threshold needs to be
>introduced.
>> 
>> Since memory accesses are dynamically gathered at runtime, it has
>cost,
>> which lies in cache data filling, status check and space allocation.
>> Runtime overhead of MGO could outweigh performance benefit if cache
>data
>> count is too small, so we also need a minimum cache count threshold.
>> 
>> With a new runtime check against two cache data count thresholds, MGO
>> transformation will be:
>> 
>>   trans_ok = iter_count >= min_cache_data_count &&
>>              iter_count <= max_cache_data_count;
>> 
>>   if (trans_ok) {
>>       cache_arr = calloc (iter_count, sizeof(cache_elem));
>>   }
>> 
>>   outer-loop () {
>>       ...
>>       if (trans_ok)  {
>>           /* transformed inner-loop */
>>       }
>>       else {
>>           /* original inner-loop */
>>       }
>>   }
>> 
>>   if (trans_ok) {
>>       free (cache_arr);
>>   }
>> 
>> 4. Runtime alias check
>> 
>> In baseline MGO, memory disambiguation for candidate load is based on
>> static alias analysis, which always provides very conservative
>result.
>> A more aggressive option is to use runtime alias check, which could
>be easily
>> integrated into data-caching code. In the following pseudo code, a
>store
>> outside inner-loop is aliased with the candidate load, and it
>prevents the load
>> and its dependent loads from being optimized if we only apply the
>> optimization via static alias analysis.
>> 
>>   Type1 v0;
>> 
>>   outer-loop () {
>> 
>>       v0->field = value;  /* Aliased with v1->field */
>> 
>>       inner-loop (iter) {
>> 
>>           Type1 v1 = LOAD_FN1 (iter);
>>           Type2 v2 = v1->field;
>>           ...
>>       }
>>      ...
>>   }
>> 
>> 
>> When runtime alias check is enabled, MGO will generate code to
>> incrementally compute address range for candidate load, and check
>address
>> overlap for may-alias stores and the range at runtime. Once conflict
>is
>> detected, execution will switch to original inner-loop. With runtime
>alias
>> check added, transformation will be:
>> 
>>   Type1 v0;
>>   std::pair<void *, void *> v1_addr_range;
>> 
>>   outer-loop () {
>> 
>>       v0->field = value;
>>       ...
>>       /* Check whether v0 is covered by range.  */
>>       trans_ok &= !IN_RANGE (&v1_addr_range, v0);
>> 
>>       if (trans_ok) {
>> 
>>           inner-loop (iter) {
>> 
>>               if (!cache_arr[cache_idx]->init) {
>>                  Type1 v1 = LOAD_FN1 (iter);
>>                  Type2 v2 = v1->field;
>> 
>>                  /* Extend range to cover v1. */
>>                  UPDATE_RANGE (&v1_addr_range, v1);
>>                  ...
>>               }
>>               ...
>>           }
>>       }
>>       else {
>>          /* original inner-loop */
>>       }
>>       ...
>>   }
>> 
>> Currently, we are working on an implementation for this proposal. For
>sure,
>> it might be limited, so hope your comments on it. Thanks.
>> 
>> Best Regards,
>> 
>> Feng


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC] A memory gathering optimization for loop
  2021-01-17 11:32   ` Richard Biener
@ 2021-01-18  3:15     ` Feng Xue OS
  0 siblings, 0 replies; 9+ messages in thread
From: Feng Xue OS @ 2021-01-18  3:15 UTC (permalink / raw)
  To: JiangNing OS, gcc, Richard Biener

>> -----Original Message-----
>> From: Feng Xue OS <fxue@os.amperecomputing.com>
>> Sent: Thursday, January 14, 2021 12:28 PM
>> To: gcc@gcc.gnu.org
>> Cc: JiangNing OS <jiangning@os.amperecomputing.com>; Hao Liu OS
>> <hliu@os.amperecomputing.com>
>> Subject: [RFC] A memory gathering optimization for loop
>>
>> 1. Background
>>
>> In a loop, it is optimal if only one memory stream is activated, that is, all
>> memory operations sequentially access one data region. But that is always
>> not the case, such as traversing link list and manipulating discrete arrays. In
>> this scenario, the loop would contain multiple scattered memory accesses,
>> which could trigger inefficient multi-way hardware prefetching, thus making
>> cpu cache hit rate drop. The tracker pr98598 (https://gcc.gnu.org/bugzilla//show_bug.cgi?id=98598)
>> gives a description on one related problem that we are meant to target.
>>
>> 2. Memory gathering optimization (MGO)
>>
>> For nested loops, if scattered memory accesses inside inner loop remain
>> unchanged in each iteration of outer loop, we can dynamically gather result
>> data into a sequential memory region when the first access in the inner loop
>> takes place. This way the hardware prefetching can be reduced, and cpu
>> cache hit rate can be improved. We name this optimization MGO (memory
>> gathering optimization). An example is given as below, which is a little
>> different from pr98598, and could not be optimized by loop distribution.
>>
>>   struct A { int **p1; };
>>
>>   int foo (int n, struct A *array)
>>   {
>>       int sum = 0;
>>
>>       for (int i = 0; i < n; i++) {
>>           for (int j = 0; j < i; j++) {  /* iteration count is i */
>>               int **p1 = array[j].p1;
>>               int  *p2 = *p1;
>>
>>               sum += *p2;
>>           }
>>       }
>>
>>       return sum;
>>   }
>>
>>


> The number of memory streams in this loop is limited (practically 1), how would the transform be affected when issuing prefetches for array at a well placed spot?
>
> That is, what's the hazard we avoid (besides the obvious data dependence
> Shortening in case the load from array is in cache?)
>
> Richard.
>

In the example, to evaluate "sum += *p2" at each iteration, three memory
loads will be involved, each of them forms a memory stream (then totally
3 at least, not 1) since they touch three discrete regions. Spatial locality
only exists for the first load "array[j].p1", some kind of prefetch either via
hardware or software works well for it, while for other two, prefetch would
be of not use and cpu cache could not speculate where the other two loads
will go.

So intention of mgo is to put those discontinuous and unpredictable
memory accesses together to form one new sequential memory stream
in favor of cpu cache.

And if we statically find hazard due to certain write dependence, mgo
will not be applied.

Thanks,
Feng

>> Although this example seems to be pretty artificial, the kind of data reuse in
>> nested loops is often seen in real applications, especially written by Fortran,
>> and C++ STL.
>>
>> To simplify explanation of this memory gathering optimization, pseudo code
>> on original nested loops is given as:
>>
>>   outer-loop ( ) { /* data in inner loop are also invariant in outer loop.  */
>>       ...
>>       inner-loop (iter, iter_count) {  /* inner loop to apply MGO */
>>
>>           Type1 v1 = LOAD_FN1 (iter);
>>           Type2 v2 = LOAD_FN2 (v1);
>>           Type3 v3 = LOAD_FN3 (v2);
>>           ...
>>           iter = NEXT_FN (iter);
>>       }
>>
>>      ...
>>   }
>>
>> "LOAD_FNx()" is a conceptual function that translates its argument to a result,
>> and is composed of a sequence of operations in which there is only one
>> memory load. It is capable of representing simple memory dereference like
>> "iter->field", or more complicated expression like "array[iter * 2 + cst].field".
>>
>> Likewise, "NEXT_FN()" is also a conceptual function which defines how an
>> iterator is advanced. It is able to describe simple integer iterator, list iterator,
>> or even pure-function-based iterator on advanced stuff like hashset/tree.
>>
>> Then the code will be transformed to:
>>
>>   typedef struct cache_elem {
>>       bool   init;
>>       Type1  c_v1;
>>       Type2  c_v2;
>>       Type3  c_v3;
>>   } cache_elem;
>>
>>   cache_elem *cache_arr = calloc (iter_count, sizeof(cache_elem));
>>
>>   outer-loop () {
>>       ...
>>       size_t cache_idx = 0;
>>
>>       inner-loop (iter, iter_count) {
>>
>>           if (!cache_arr[cache_idx]->init) {
>>               v1 = LOAD_FN1 (iter);
>>               v2 = LOAD_FN2 (v1);
>>               v3 = LOAD_FN3 (v2);
>>
>>               cache_arr[cache_idx]->init = true;
>>               cache_arr[cache_idx]->c_v1 = v1;
>>               cache_arr[cache_idx]->c_v2 = v2;
>>               cache_arr[cache_idx]->c_v3 = v3;
>>           }
>>           else {
>>               v1 = cache_arr[cache_idx]->c_v1;
>>               v2 = cache_arr[cache_idx]->c_v2;
>>               v3 = cache_arr[cache_idx]->c_v3;
>>           }
>>           ...
>>           cache_idx++;
>>           iter = NEXT_FN (iter);
>>       }
>>       ...
>>   }
>>
>>   free (cache_arr);
>
> The example in PR#98598 preloads the data before the outer loop, while the
> transformation in this RFC lazily caches the data in the inner loop, which is to
> avoid introducing unnecessary data RW hazards.
>
>>
>> In the transformation, cache space belongs to compiler-generated temporary
>> storage, but it is from user heap. We find this trick is very uncommon in gcc.
>> We'd like to make sure whether there are any special considerations or
>> restrictions on this. Anyway, using alloca() to get space from stack is an
>> alternative, but we should be cautious to avoid incurring stack overflow by
>> non-user logic.
>>
>> "iter_count" stands for an upper bound for inner-loop iteration count, which
>> must be an outer-loop invariant expression, in that it determines data count
>> in cache space, and the space is allocated before outer-loop.
>>
>> In the example below, we can simply get such bound for inner-loop 1 from
>> its niter_desc, while for inner-loop 2, it is not that easy, but it is possible to
>> infer that the loop's iteration count never exceeds "n" via comparison
>> predicate information.
>>
>>   foo (int n, int m)
>>   {
>>       for (int i = 0; i < n; i++ ) {      /* outer-loop 1 */
>>           ...
>>           for (int j = 0; j < m; j++) {   /* inner-loop 1 */
>>               ...
>>           }
>>           ...
>>       }
>>
>>       for (int i = 0; i < n; i ++) {     /* outer-loop 2 */
>>           ...
>>           for (int j = 0; j < i; j++) {  /* inner-loop 2 */
>>               ...
>>           }
>>           ...
>>       }
>>   }
>>
>> 3. Cache count threshold
>>
>> We expect cache space works as normal stack temporary, and would not
>> impact execution of program so much, especially when heap is heavily used
>> in user code, and at least should not exhaust memory and make the program
>> broken. For this, a maximum cache count threshold needs to be introduced.
>>
>> Since memory accesses are dynamically gathered at runtime, it has cost,
>> which lies in cache data filling, status check and space allocation.
>> Runtime overhead of MGO could outweigh performance benefit if cache data
>> count is too small, so we also need a minimum cache count threshold.
>>
>> With a new runtime check against two cache data count thresholds, MGO
>> transformation will be:
>>
>>   trans_ok = iter_count >= min_cache_data_count &&
>>              iter_count <= max_cache_data_count;
>>
>>   if (trans_ok) {
>>       cache_arr = calloc (iter_count, sizeof(cache_elem));
>>   }
>>
>>   outer-loop () {
>>       ...
>>       if (trans_ok)  {
>>           /* transformed inner-loop */
>>       }
>>       else {
>>           /* original inner-loop */
>>       }
>>   }
>>
>>   if (trans_ok) {
>>       free (cache_arr);
>>   }
>>
>> 4. Runtime alias check
>>
>> In baseline MGO, memory disambiguation for candidate load is based on
>> static alias analysis, which always provides very conservative result.
>> A more aggressive option is to use runtime alias check, which could be easily
>> integrated into data-caching code. In the following pseudo code, a store
>> outside inner-loop is aliased with the candidate load, and it prevents the load
>> and its dependent loads from being optimized if we only apply the
>> optimization via static alias analysis.
>>
>>   Type1 v0;
>>
>>   outer-loop () {
>>
>>       v0->field = value;  /* Aliased with v1->field */
>>
>>       inner-loop (iter) {
>>
>>           Type1 v1 = LOAD_FN1 (iter);
>>           Type2 v2 = v1->field;
>>           ...
>>       }
>>      ...
>>   }
>>
>>
>> When runtime alias check is enabled, MGO will generate code to
>> incrementally compute address range for candidate load, and check address
>> overlap for may-alias stores and the range at runtime. Once conflict is
>> detected, execution will switch to original inner-loop. With runtime alias
>> check added, transformation will be:
>>
>>   Type1 v0;
>>   std::pair<void *, void *> v1_addr_range;
>>
>>   outer-loop () {
>>
>>       v0->field = value;
>>       ...
>>       /* Check whether v0 is covered by range.  */
>>       trans_ok &= !IN_RANGE (&v1_addr_range, v0);
>>
>>       if (trans_ok) {
>>
>>           inner-loop (iter) {
>>
>>               if (!cache_arr[cache_idx]->init) {
>>                  Type1 v1 = LOAD_FN1 (iter);
>>                  Type2 v2 = v1->field;
>>
>>                  /* Extend range to cover v1. */
>>                  UPDATE_RANGE (&v1_addr_range, v1);
>>                  ...
>>               }
>>               ...
>>           }
>>       }
>>       else {
>>          /* original inner-loop */
>>       }
>>       ...
>>   }
>>
>> Currently, we are working on an implementation for this proposal. For sure,
>> it might be limited, so hope your comments on it. Thanks.
>>
>> Best Regards,
>>
>> Feng
>>
>>
>>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC] A memory gathering optimization for loop
  2021-01-14  4:28 [RFC] A memory gathering optimization for loop Feng Xue OS
  2021-01-17  1:23 ` JiangNing OS
@ 2021-04-30  6:30 ` Florian Weimer
  2021-04-30 12:26   ` Richard Biener
  2021-05-07  3:12   ` Feng Xue OS
  1 sibling, 2 replies; 9+ messages in thread
From: Florian Weimer @ 2021-04-30  6:30 UTC (permalink / raw)
  To: Feng Xue OS via Gcc; +Cc: Feng Xue OS, JiangNing OS

* Feng Xue:

> To simplify explanation of this memory gathering optimization, pseudo
> code on original nested loops is given as:
>
>   outer-loop ( ) { /* data in inner loop are also invariant in outer loop.  */
>       ...
>       inner-loop (iter, iter_count) {  /* inner loop to apply MGO */
>
>           Type1 v1 = LOAD_FN1 (iter);
>           Type2 v2 = LOAD_FN2 (v1);
>           Type3 v3 = LOAD_FN3 (v2);
>           ... 
>           iter = NEXT_FN (iter);
>       }
>
>      ...
>   }
>
> "LOAD_FNx()" is a conceptual function that translates its argument to a
> result, and is composed of a sequence of operations in which there is
> only one memory load. It is capable of representing simple memory
> dereference like "iter->field", or more complicated expression like
> "array[iter * 2 + cst].field".
>
> Likewise, "NEXT_FN()" is also a conceptual function which defines how an
> iterator is advanced. It is able to describe simple integer iterator,
> list iterator, or even pure-function-based iterator on advanced stuff
> like hashset/tree.
>
> Then the code will be transformed to:
>
>   typedef struct cache_elem {
>       bool   init;
>       Type1  c_v1;
>       Type2  c_v2;
>       Type3  c_v3;
>   } cache_elem;
>
>   cache_elem *cache_arr = calloc (iter_count, sizeof(cache_elem));
>
>   outer-loop () {
>       ...
>       size_t cache_idx = 0;
>
>       inner-loop (iter, iter_count) {
>
>           if (!cache_arr[cache_idx]->init) {
>               v1 = LOAD_FN1 (iter);
>               v2 = LOAD_FN2 (v1);
>               v3 = LOAD_FN3 (v2);
>
>               cache_arr[cache_idx]->init = true;
>               cache_arr[cache_idx]->c_v1 = v1;
>               cache_arr[cache_idx]->c_v2 = v2;
>               cache_arr[cache_idx]->c_v3 = v3;
>           }
>           else {
>               v1 = cache_arr[cache_idx]->c_v1;
>               v2 = cache_arr[cache_idx]->c_v2;
>               v3 = cache_arr[cache_idx]->c_v3;
>           }
>           ...
>           cache_idx++;
>           iter = NEXT_FN (iter);
>       }
>       ...
>   }
>
>   free (cache_arr);

Why do you need the init field?  Isn't the array initialized
sequentially?

You need to keep the original loop and use it if calloc fails.  The
transformation is not valid in general because calloc is not an
async-signal-safe function, and GCC shouldn't introduce such calls if
they are not present in the original source code.  For
-fnon-call-exceptions, you need to introduce unwinding code that calls
free.

Can you change this optimization so that it can use a fixed-size buffer?
This would avoid all issues around calling calloc.  If iter_count can be
very large, allocating that much extra memory might not be beneficial
anyway.

Thanks,
Florian


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC] A memory gathering optimization for loop
  2021-04-30  6:30 ` Florian Weimer
@ 2021-04-30 12:26   ` Richard Biener
  2021-04-30 12:50     ` Florian Weimer
  2021-05-07  3:12   ` Feng Xue OS
  1 sibling, 1 reply; 9+ messages in thread
From: Richard Biener @ 2021-04-30 12:26 UTC (permalink / raw)
  To: Florian Weimer; +Cc: Feng Xue OS via Gcc, JiangNing OS

On Fri, Apr 30, 2021 at 9:51 AM Florian Weimer via Gcc <gcc@gcc.gnu.org> wrote:
>
> * Feng Xue:
>
> > To simplify explanation of this memory gathering optimization, pseudo
> > code on original nested loops is given as:
> >
> >   outer-loop ( ) { /* data in inner loop are also invariant in outer loop.  */
> >       ...
> >       inner-loop (iter, iter_count) {  /* inner loop to apply MGO */
> >
> >           Type1 v1 = LOAD_FN1 (iter);
> >           Type2 v2 = LOAD_FN2 (v1);
> >           Type3 v3 = LOAD_FN3 (v2);
> >           ...
> >           iter = NEXT_FN (iter);
> >       }
> >
> >      ...
> >   }
> >
> > "LOAD_FNx()" is a conceptual function that translates its argument to a
> > result, and is composed of a sequence of operations in which there is
> > only one memory load. It is capable of representing simple memory
> > dereference like "iter->field", or more complicated expression like
> > "array[iter * 2 + cst].field".
> >
> > Likewise, "NEXT_FN()" is also a conceptual function which defines how an
> > iterator is advanced. It is able to describe simple integer iterator,
> > list iterator, or even pure-function-based iterator on advanced stuff
> > like hashset/tree.
> >
> > Then the code will be transformed to:
> >
> >   typedef struct cache_elem {
> >       bool   init;
> >       Type1  c_v1;
> >       Type2  c_v2;
> >       Type3  c_v3;
> >   } cache_elem;
> >
> >   cache_elem *cache_arr = calloc (iter_count, sizeof(cache_elem));
> >
> >   outer-loop () {
> >       ...
> >       size_t cache_idx = 0;
> >
> >       inner-loop (iter, iter_count) {
> >
> >           if (!cache_arr[cache_idx]->init) {
> >               v1 = LOAD_FN1 (iter);
> >               v2 = LOAD_FN2 (v1);
> >               v3 = LOAD_FN3 (v2);
> >
> >               cache_arr[cache_idx]->init = true;
> >               cache_arr[cache_idx]->c_v1 = v1;
> >               cache_arr[cache_idx]->c_v2 = v2;
> >               cache_arr[cache_idx]->c_v3 = v3;
> >           }
> >           else {
> >               v1 = cache_arr[cache_idx]->c_v1;
> >               v2 = cache_arr[cache_idx]->c_v2;
> >               v3 = cache_arr[cache_idx]->c_v3;
> >           }
> >           ...
> >           cache_idx++;
> >           iter = NEXT_FN (iter);
> >       }
> >       ...
> >   }
> >
> >   free (cache_arr);
>
> Why do you need the init field?  Isn't the array initialized
> sequentially?
>
> You need to keep the original loop and use it if calloc fails.  The
> transformation is not valid in general because calloc is not an
> async-signal-safe function, and GCC shouldn't introduce such calls if
> they are not present in the original source code.  For
> -fnon-call-exceptions, you need to introduce unwinding code that calls
> free.
>
> Can you change this optimization so that it can use a fixed-size buffer?
> This would avoid all issues around calling calloc.  If iter_count can be
> very large, allocating that much extra memory might not be beneficial
> anyway.

It would need to be TLS storage though or protected against concurrent
access (maybe an atomic is enough if we drop to the original loop when
concurrent access is detected).

Richard.

> Thanks,
> Florian
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC] A memory gathering optimization for loop
  2021-04-30 12:26   ` Richard Biener
@ 2021-04-30 12:50     ` Florian Weimer
  0 siblings, 0 replies; 9+ messages in thread
From: Florian Weimer @ 2021-04-30 12:50 UTC (permalink / raw)
  To: Richard Biener; +Cc: Feng Xue OS via Gcc, JiangNing OS

* Richard Biener:

>> Can you change this optimization so that it can use a fixed-size buffer?
>> This would avoid all issues around calling calloc.  If iter_count can be
>> very large, allocating that much extra memory might not be beneficial
>> anyway.
>
> It would need to be TLS storage though or protected against concurrent
> access (maybe an atomic is enough if we drop to the original loop when
> concurrent access is detected).

If it can be constrained to a small window, then it can be allocated on
the stack.

But I suspect the point of the optimization is to do this data
reorganization when huge amounts of temporary memory are needed.  I
don't see a way to make this safe in general.

glibc's qsort implementation does something quite similar manually,
including the extra memory allocations, and we occasionally receive
complaints about it.

Thanks,
Florian

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC] A memory gathering optimization for loop
  2021-04-30  6:30 ` Florian Weimer
  2021-04-30 12:26   ` Richard Biener
@ 2021-05-07  3:12   ` Feng Xue OS
  1 sibling, 0 replies; 9+ messages in thread
From: Feng Xue OS @ 2021-05-07  3:12 UTC (permalink / raw)
  To: Florian Weimer, Feng Xue OS via Gcc; +Cc: JiangNing OS

>> To simplify explanation of this memory gathering optimization, pseudo
>> code on original nested loops is given as:
>>
>>   outer-loop ( ) { /* data in inner loop are also invariant in outer loop.  */
>>       ...
>>       inner-loop (iter, iter_count) {  /* inner loop to apply MGO */
>>
>>           Type1 v1 = LOAD_FN1 (iter);
>>           Type2 v2 = LOAD_FN2 (v1);
>>           Type3 v3 = LOAD_FN3 (v2);
>>           ...
>>           iter = NEXT_FN (iter);
>>       }
>>
>>      ...
>>   }
>>
>> "LOAD_FNx()" is a conceptual function that translates its argument to a
>> result, and is composed of a sequence of operations in which there is
>> only one memory load. It is capable of representing simple memory
>> dereference like "iter->field", or more complicated expression like
>> "array[iter * 2 + cst].field".
>>
>> Likewise, "NEXT_FN()" is also a conceptual function which defines how an
>> iterator is advanced. It is able to describe simple integer iterator,
>> list iterator, or even pure-function-based iterator on advanced stuff
>> like hashset/tree.
>>
>> Then the code will be transformed to:
>>
>>   typedef struct cache_elem {
>>       bool   init;
>>       Type1  c_v1;
>>       Type2  c_v2;
>>       Type3  c_v3;
>>   } cache_elem;
>>
>>   cache_elem *cache_arr = calloc (iter_count, sizeof(cache_elem));
>>
>>   outer-loop () {
>>       ...
>>       size_t cache_idx = 0;
>>
>>       inner-loop (iter, iter_count) {
>>
>>           if (!cache_arr[cache_idx]->init) {
>>               v1 = LOAD_FN1 (iter);
>>               v2 = LOAD_FN2 (v1);
>>               v3 = LOAD_FN3 (v2);
>>
>>               cache_arr[cache_idx]->init = true;
>>               cache_arr[cache_idx]->c_v1 = v1;
>>               cache_arr[cache_idx]->c_v2 = v2;
>>               cache_arr[cache_idx]->c_v3 = v3;
>>           }
>>           else {
>>               v1 = cache_arr[cache_idx]->c_v1;
>>               v2 = cache_arr[cache_idx]->c_v2;
>>               v3 = cache_arr[cache_idx]->c_v3;
>>           }
>>           ...
>>           cache_idx++;
>>           iter = NEXT_FN (iter);
>>       }
>>       ...
>>   }
>>
>>   free (cache_arr);
> 
> Why do you need the init field?  Isn't the array initialized
> sequentially?

This is meant to support more complicate pattern, in which
actual iteration count of inner_loop depends on outer_loop,
and "iter_count" only represents an up-bound of the count.

for (i = 0; i < N; i++)
  {
    for (j = 0; j < i; j++) 
       {
           Type1 v1 = LOAD_FN1 (j);
           Type2 v2 = LOAD_FN2 (v1);
           Type3 v3 = LOAD_FN3 (v2);

           ...

           condition = ...
       }

    if (condition)
      break;
  }

In above case, "iter_count" is N, and is not exact count of inner_loop.
Before going through all loads, loop execution might end up with
certain condition. We should not cache all of them in one step since
some loads might be invalid if condition is not met. So we need a
"init" field to track whether a caching element got properly filled,
and allows runtime switch between "do caching" and "use caching".

> You need to keep the original loop and use it if calloc fails.  The
> transformation is not valid in general because calloc is not an
> async-signal-safe function, and GCC shouldn't introduce such calls if
> they are not present in the original source code.  For
> -fnon-call-exceptions, you need to introduce unwinding code that calls
> free.

This is an issue that we missed. The only way left is to use alloca(), but may
be not suitable for large data caching.

> Can you change this optimization so that it can use a fixed-size buffer?
> This would avoid all issues around calling calloc.  If iter_count can be
> very large, allocating that much extra memory might not be beneficial
> anyway.

And need additional code to synchronize access to this buffer.

Thanks,
Feng

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [RFC] A memory gathering optimization for loop
@ 2021-01-14  2:40 Feng Xue OS
  0 siblings, 0 replies; 9+ messages in thread
From: Feng Xue OS @ 2021-01-14  2:40 UTC (permalink / raw)
  To: Alan Modra via Gcc; +Cc: JiangNing OS, Hao Liu OS

1. Background

In a loop, it is optimal if only one memory stream is activated, that
is, all memory operations sequentially access one data region. But that
is always not the case, such as traversing link list and manipulating
discrete arrays. In this scenario, the loop would contain multiple
scattered memory accesses, which could trigger inefficient multi-way
hardware prefetching, thus making cpu cache hit rate drop. The tracker
pr98598 (https://gcc.gnu.org/bugzilla//show_bug.cgi?id=98598) gives a
description on one related problem that we are meant to target.

2. Memory gathering optimization (MGO)

For nested loops, if scattered memory accesses inside inner loop remain
unchanged in each iteration of outer loop, we can dynamically gather
result data into a sequential memory region when the first access in the
inner loop takes place. This way the hardware prefetching can be reduced,
and cpu cache hit rate can be improved. We name this optimization MGO
(memory gathering optimization). An example is given as below, which is
a little different from pr98598, and could not be optimized by loop
distribution.

  struct A { int **p1; };

  int foo (int n, struct A *array)
  {
      int sum = 0;

      for (int i = 0; i < n; i++) {
          for (int j = 0; j < i; j++) {  /* iteration count is i */
              int **p1 = array[j].p1;
              int  *p2 = *p1;

              sum += *p2;
          }
      }

      return sum;
  }

Although this example seems to be pretty artificial, the kind of data
reuse in nested loops is often seen in real applications, especially
written by Fortran, and C++ STL.

To simplify explanation of this memory gathering optimization, pseudo
code on original nested loops is given as:

  outer-loop ( ) { /* data in inner loop are also invariant in outer loop.  */
      ...
      inner-loop (iter, iter_count) {  /* inner loop to apply MGO */

          Type1 v1 = LOAD_FN1 (iter);
          Type2 v2 = LOAD_FN2 (v1);
          Type3 v3 = LOAD_FN3 (v2);
          ... 
          iter = NEXT_FN (iter);
      }

     ...
  }

"LOAD_FNx()" is a conceptual function that translates its argument to a
result, and is composed of a sequence of operations in which there is
only one memory load. It is capable of representing simple memory
dereference like "iter->field", or more complicated expression like
"array[iter * 2 + cst].field".

Likewise, "NEXT_FN()" is also a conceptual function which defines how an
iterator is advanced. It is able to describe simple integer iterator,
list iterator, or even pure-function-based iterator on advanced stuff
like hashset/tree.

Then the code will be transformed to:

  typedef struct cache_elem {
      bool   init;
      Type1  c_v1;
      Type2  c_v2;
      Type3  c_v3;
  } cache_elem;

  cache_elem *cache_arr = calloc (iter_count, sizeof(cache_elem));

  outer-loop () {
      ...
      size_t cache_idx = 0;

      inner-loop (iter, iter_count) {

          if (!cache_arr[cache_idx]->init) {
              v1 = LOAD_FN1 (iter);
              v2 = LOAD_FN2 (v1);
              v3 = LOAD_FN3 (v2);

              cache_arr[cache_idx]->init = true;
              cache_arr[cache_idx]->c_v1 = v1;
              cache_arr[cache_idx]->c_v2 = v2;
              cache_arr[cache_idx]->c_v3 = v3;
          }
          else {
              v1 = cache_arr[cache_idx]->c_v1;
              v2 = cache_arr[cache_idx]->c_v2;
              v3 = cache_arr[cache_idx]->c_v3;
          }
          ...
          cache_idx++;
          iter = NEXT_FN (iter);
      }
      ...
  }

  free (cache_arr);

In the transformation, cache space belongs to compiler-generated
temporary storage, but it is from user heap. We find this trick is very
uncommon in gcc. We’d like to make sure whether there are any special
considerations or restrictions on this. Anyway, using alloca() to get
space from stack is an alternative, but we should be cautious to avoid
incurring stack overflow by non-user logic.

"iter_count" stands for an upper bound for inner-loop iteration count,
which must be an outer-loop invariant expression, in that it determines
data count in cache space, and the space is allocated before outer-loop.

In the example below, we can simply get such bound for inner-loop 1 from
its niter_desc, while for inner-loop 2, it is not that easy, but it is
possible to infer that the loop's iteration count never exceeds "n" via
comparison predicate information.

  foo (int n, int m)
  {
      for (int i = 0; i < n; i++ ) {      /* outer-loop 1 */
          ...
          for (int j = 0; j < m; j++) {   /* inner-loop 1 */
              ...
          }
          ...
      }

      for (int i = 0; i < n; i ++) {     /* outer-loop 2 */
          ...
          for (int j = 0; j < i; j++) {  /* inner-loop 2 */
              ...
          }
          ...
      }
  }

3. Cache count threshold

We expect cache space works as normal stack temporary, and would not
impact execution of program so much, especially when heap is heavily
used in user code, and at least should not exhaust memory and make the
program broken. For this, a maximum cache count threshold needs to be
introduced.

Since memory accesses are dynamically gathered at runtime, it has cost,
which lies in cache data filling, status check and space allocation.
Runtime overhead of MGO could outweigh performance benefit if cache data
count is too small, so we also need a minimum cache count threshold.

With a new runtime check against two cache data count thresholds, MGO
transformation will be:

  trans_ok = iter_count >= min_cache_data_count &&
             iter_count <= max_cache_data_count;

  if (trans_ok) {
      cache_arr = calloc (iter_count, sizeof(cache_elem));
  }

  outer-loop () {
      ...
      if (trans_ok)  {
          /* transformed inner-loop */
      }
      else {
          /* original inner-loop */
      }
  }

  if (trans_ok) {
      free (cache_arr);
  }

4. Runtime alias check

In baseline MGO, memory disambiguation for candidate load is based on
static alias analysis, which always provides very conservative result.
A more aggressive option is to use runtime alias check, which could be
easily integrated into data-caching code. In the following pseudo code,
a store outside inner-loop is aliased with the candidate load, and it
prevents the load and its dependent loads from being optimized if we
only apply the optimization via static alias analysis.

  Type1 v0;

  outer-loop () {

      v0->field = value;  /* Aliased with v1->field */

      inner-loop (iter) {

          Type1 v1 = LOAD_FN1 (iter);
          Type2 v2 = v1->field;
          ...
      }
     ...
  }

When runtime alias check is enabled, MGO will generate code to
incrementally compute address range for candidate load, and check
address overlap for may-alias stores and the range at runtime. Once
conflict is detected, execution will switch to original inner-loop. With
runtime alias check added, transformation will be:

  Type1 v0;
  std::pair<void *, void *> v1_addr_range;

  outer-loop () {

      v0->field = value;
      ...
      /* Check whether v0 is covered by range.  */
      trans_ok &= !IN_RANGE (&v1_addr_range, v0);

      if (trans_ok) {

          inner-loop (iter) {

              if (!cache_arr[cache_idx]->init) {
                 Type1 v1 = LOAD_FN1 (iter);
                 Type2 v2 = v1->field;

                 /* Extend range to cover v1. */
                 UPDATE_RANGE (&v1_addr_range, v1);
                 ...
              }
              ...
          }
      }
      else {
         /* original inner-loop */
      }
      ...
  }

Currently, we are working on an implementation for this proposal. For
sure, it might be limited, so hope your comments on it. Thanks.

Best Regards,

Feng

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2021-05-07  3:12 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-14  4:28 [RFC] A memory gathering optimization for loop Feng Xue OS
2021-01-17  1:23 ` JiangNing OS
2021-01-17 11:32   ` Richard Biener
2021-01-18  3:15     ` Feng Xue OS
2021-04-30  6:30 ` Florian Weimer
2021-04-30 12:26   ` Richard Biener
2021-04-30 12:50     ` Florian Weimer
2021-05-07  3:12   ` Feng Xue OS
  -- strict thread matches above, loose matches on Subject: below --
2021-01-14  2:40 Feng Xue OS

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).