public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed
* [RFC] A memory gathering optimization for loop
@ 2021-01-14  4:28 Feng Xue OS
  2021-01-17  1:23 ` JiangNing OS
  2021-04-30  6:30 ` Florian Weimer
  0 siblings, 2 replies; 9+ messages in thread
From: Feng Xue OS @ 2021-01-14  4:28 UTC (permalink / raw)
  To: gcc; +Cc: JiangNing OS, Hao Liu OS

1. Background

In a loop, it is optimal if only one memory stream is activated, that
is, all memory operations sequentially access one data region. But that
is always not the case, such as traversing link list and manipulating
discrete arrays. In this scenario, the loop would contain multiple
scattered memory accesses, which could trigger inefficient multi-way
hardware prefetching, thus making cpu cache hit rate drop. The tracker
pr98598 (https://gcc.gnu.org/bugzilla//show_bug.cgi?id=98598) gives a
description on one related problem that we are meant to target.

2. Memory gathering optimization (MGO)

For nested loops, if scattered memory accesses inside inner loop remain
unchanged in each iteration of outer loop, we can dynamically gather
result data into a sequential memory region when the first access in the
inner loop takes place. This way the hardware prefetching can be reduced,
and cpu cache hit rate can be improved. We name this optimization MGO
(memory gathering optimization). An example is given as below, which is
a little different from pr98598, and could not be optimized by loop
distribution.

  struct A { int **p1; };

  int foo (int n, struct A *array)
  {
      int sum = 0;

      for (int i = 0; i < n; i++) {
          for (int j = 0; j < i; j++) {  /* iteration count is i */
              int **p1 = array[j].p1;
              int  *p2 = *p1;

              sum += *p2;
          }
      }

      return sum;
  }

Although this example seems to be pretty artificial, the kind of data
reuse in nested loops is often seen in real applications, especially
written by Fortran, and C++ STL.

To simplify explanation of this memory gathering optimization, pseudo
code on original nested loops is given as:

  outer-loop ( ) { /* data in inner loop are also invariant in outer loop.  */
      ...
      inner-loop (iter, iter_count) {  /* inner loop to apply MGO */

          Type1 v1 = LOAD_FN1 (iter);
          Type2 v2 = LOAD_FN2 (v1);
          Type3 v3 = LOAD_FN3 (v2);
          ... 
          iter = NEXT_FN (iter);
      }

     ...
  }

"LOAD_FNx()" is a conceptual function that translates its argument to a
result, and is composed of a sequence of operations in which there is
only one memory load. It is capable of representing simple memory
dereference like "iter->field", or more complicated expression like
"array[iter * 2 + cst].field".

Likewise, "NEXT_FN()" is also a conceptual function which defines how an
iterator is advanced. It is able to describe simple integer iterator,
list iterator, or even pure-function-based iterator on advanced stuff
like hashset/tree.

Then the code will be transformed to:

  typedef struct cache_elem {
      bool   init;
      Type1  c_v1;
      Type2  c_v2;
      Type3  c_v3;
  } cache_elem;

  cache_elem *cache_arr = calloc (iter_count, sizeof(cache_elem));

  outer-loop () {
      ...
      size_t cache_idx = 0;

      inner-loop (iter, iter_count) {

          if (!cache_arr[cache_idx]->init) {
              v1 = LOAD_FN1 (iter);
              v2 = LOAD_FN2 (v1);
              v3 = LOAD_FN3 (v2);

              cache_arr[cache_idx]->init = true;
              cache_arr[cache_idx]->c_v1 = v1;
              cache_arr[cache_idx]->c_v2 = v2;
              cache_arr[cache_idx]->c_v3 = v3;
          }
          else {
              v1 = cache_arr[cache_idx]->c_v1;
              v2 = cache_arr[cache_idx]->c_v2;
              v3 = cache_arr[cache_idx]->c_v3;
          }
          ...
          cache_idx++;
          iter = NEXT_FN (iter);
      }
      ...
  }

  free (cache_arr);

In the transformation, cache space belongs to compiler-generated
temporary storage, but it is from user heap. We find this trick is very
uncommon in gcc. We’d like to make sure whether there are any special
considerations or restrictions on this. Anyway, using alloca() to get
space from stack is an alternative, but we should be cautious to avoid
incurring stack overflow by non-user logic.

"iter_count" stands for an upper bound for inner-loop iteration count,
which must be an outer-loop invariant expression, in that it determines
data count in cache space, and the space is allocated before outer-loop.

In the example below, we can simply get such bound for inner-loop 1 from
its niter_desc, while for inner-loop 2, it is not that easy, but it is
possible to infer that the loop's iteration count never exceeds "n" via
comparison predicate information.

  foo (int n, int m)
  {
      for (int i = 0; i < n; i++ ) {      /* outer-loop 1 */
          ...
          for (int j = 0; j < m; j++) {   /* inner-loop 1 */
              ...
          }
          ...
      }

      for (int i = 0; i < n; i ++) {     /* outer-loop 2 */
          ...
          for (int j = 0; j < i; j++) {  /* inner-loop 2 */
              ...
          }
          ...
      }
  }

3. Cache count threshold

We expect cache space works as normal stack temporary, and would not
impact execution of program so much, especially when heap is heavily
used in user code, and at least should not exhaust memory and make the
program broken. For this, a maximum cache count threshold needs to be
introduced.

Since memory accesses are dynamically gathered at runtime, it has cost,
which lies in cache data filling, status check and space allocation.
Runtime overhead of MGO could outweigh performance benefit if cache data
count is too small, so we also need a minimum cache count threshold.

With a new runtime check against two cache data count thresholds, MGO
transformation will be:

  trans_ok = iter_count >= min_cache_data_count &&
             iter_count <= max_cache_data_count;

  if (trans_ok) {
      cache_arr = calloc (iter_count, sizeof(cache_elem));
  }

  outer-loop () {
      ...
      if (trans_ok)  {
          /* transformed inner-loop */
      }
      else {
          /* original inner-loop */
      }
  }

  if (trans_ok) {
      free (cache_arr);
  }

4. Runtime alias check

In baseline MGO, memory disambiguation for candidate load is based on
static alias analysis, which always provides very conservative result.
A more aggressive option is to use runtime alias check, which could be
easily integrated into data-caching code. In the following pseudo code,
a store outside inner-loop is aliased with the candidate load, and it
prevents the load and its dependent loads from being optimized if we
only apply the optimization via static alias analysis.

  Type1 v0;

  outer-loop () {

      v0->field = value;  /* Aliased with v1->field */

      inner-loop (iter) {

          Type1 v1 = LOAD_FN1 (iter);
          Type2 v2 = v1->field;
          ...
      }
     ...
  }


When runtime alias check is enabled, MGO will generate code to
incrementally compute address range for candidate load, and check
address overlap for may-alias stores and the range at runtime. Once
conflict is detected, execution will switch to original inner-loop. With
runtime alias check added, transformation will be:

  Type1 v0;
  std::pair<void *, void *> v1_addr_range;

  outer-loop () {

      v0->field = value;
      ...
      /* Check whether v0 is covered by range.  */
      trans_ok &= !IN_RANGE (&v1_addr_range, v0);

      if (trans_ok) {

          inner-loop (iter) {

              if (!cache_arr[cache_idx]->init) {
                 Type1 v1 = LOAD_FN1 (iter);
                 Type2 v2 = v1->field;

                 /* Extend range to cover v1. */
                 UPDATE_RANGE (&v1_addr_range, v1);
                 ...
              }
              ...
          }
      }
      else {
         /* original inner-loop */
      }
      ...
  }

Currently, we are working on an implementation for this proposal. For
sure, it might be limited, so hope your comments on it. Thanks.

Best Regards,

Feng

^ permalink raw reply	[flat|nested] 9+ messages in thread
* [RFC] A memory gathering optimization for loop
@ 2021-01-14  2:40 Feng Xue OS
  0 siblings, 0 replies; 9+ messages in thread
From: Feng Xue OS @ 2021-01-14  2:40 UTC (permalink / raw)
  To: Alan Modra via Gcc; +Cc: JiangNing OS, Hao Liu OS

1. Background

In a loop, it is optimal if only one memory stream is activated, that
is, all memory operations sequentially access one data region. But that
is always not the case, such as traversing link list and manipulating
discrete arrays. In this scenario, the loop would contain multiple
scattered memory accesses, which could trigger inefficient multi-way
hardware prefetching, thus making cpu cache hit rate drop. The tracker
pr98598 (https://gcc.gnu.org/bugzilla//show_bug.cgi?id=98598) gives a
description on one related problem that we are meant to target.

2. Memory gathering optimization (MGO)

For nested loops, if scattered memory accesses inside inner loop remain
unchanged in each iteration of outer loop, we can dynamically gather
result data into a sequential memory region when the first access in the
inner loop takes place. This way the hardware prefetching can be reduced,
and cpu cache hit rate can be improved. We name this optimization MGO
(memory gathering optimization). An example is given as below, which is
a little different from pr98598, and could not be optimized by loop
distribution.

  struct A { int **p1; };

  int foo (int n, struct A *array)
  {
      int sum = 0;

      for (int i = 0; i < n; i++) {
          for (int j = 0; j < i; j++) {  /* iteration count is i */
              int **p1 = array[j].p1;
              int  *p2 = *p1;

              sum += *p2;
          }
      }

      return sum;
  }

Although this example seems to be pretty artificial, the kind of data
reuse in nested loops is often seen in real applications, especially
written by Fortran, and C++ STL.

To simplify explanation of this memory gathering optimization, pseudo
code on original nested loops is given as:

  outer-loop ( ) { /* data in inner loop are also invariant in outer loop.  */
      ...
      inner-loop (iter, iter_count) {  /* inner loop to apply MGO */

          Type1 v1 = LOAD_FN1 (iter);
          Type2 v2 = LOAD_FN2 (v1);
          Type3 v3 = LOAD_FN3 (v2);
          ... 
          iter = NEXT_FN (iter);
      }

     ...
  }

"LOAD_FNx()" is a conceptual function that translates its argument to a
result, and is composed of a sequence of operations in which there is
only one memory load. It is capable of representing simple memory
dereference like "iter->field", or more complicated expression like
"array[iter * 2 + cst].field".

Likewise, "NEXT_FN()" is also a conceptual function which defines how an
iterator is advanced. It is able to describe simple integer iterator,
list iterator, or even pure-function-based iterator on advanced stuff
like hashset/tree.

Then the code will be transformed to:

  typedef struct cache_elem {
      bool   init;
      Type1  c_v1;
      Type2  c_v2;
      Type3  c_v3;
  } cache_elem;

  cache_elem *cache_arr = calloc (iter_count, sizeof(cache_elem));

  outer-loop () {
      ...
      size_t cache_idx = 0;

      inner-loop (iter, iter_count) {

          if (!cache_arr[cache_idx]->init) {
              v1 = LOAD_FN1 (iter);
              v2 = LOAD_FN2 (v1);
              v3 = LOAD_FN3 (v2);

              cache_arr[cache_idx]->init = true;
              cache_arr[cache_idx]->c_v1 = v1;
              cache_arr[cache_idx]->c_v2 = v2;
              cache_arr[cache_idx]->c_v3 = v3;
          }
          else {
              v1 = cache_arr[cache_idx]->c_v1;
              v2 = cache_arr[cache_idx]->c_v2;
              v3 = cache_arr[cache_idx]->c_v3;
          }
          ...
          cache_idx++;
          iter = NEXT_FN (iter);
      }
      ...
  }

  free (cache_arr);

In the transformation, cache space belongs to compiler-generated
temporary storage, but it is from user heap. We find this trick is very
uncommon in gcc. We’d like to make sure whether there are any special
considerations or restrictions on this. Anyway, using alloca() to get
space from stack is an alternative, but we should be cautious to avoid
incurring stack overflow by non-user logic.

"iter_count" stands for an upper bound for inner-loop iteration count,
which must be an outer-loop invariant expression, in that it determines
data count in cache space, and the space is allocated before outer-loop.

In the example below, we can simply get such bound for inner-loop 1 from
its niter_desc, while for inner-loop 2, it is not that easy, but it is
possible to infer that the loop's iteration count never exceeds "n" via
comparison predicate information.

  foo (int n, int m)
  {
      for (int i = 0; i < n; i++ ) {      /* outer-loop 1 */
          ...
          for (int j = 0; j < m; j++) {   /* inner-loop 1 */
              ...
          }
          ...
      }

      for (int i = 0; i < n; i ++) {     /* outer-loop 2 */
          ...
          for (int j = 0; j < i; j++) {  /* inner-loop 2 */
              ...
          }
          ...
      }
  }

3. Cache count threshold

We expect cache space works as normal stack temporary, and would not
impact execution of program so much, especially when heap is heavily
used in user code, and at least should not exhaust memory and make the
program broken. For this, a maximum cache count threshold needs to be
introduced.

Since memory accesses are dynamically gathered at runtime, it has cost,
which lies in cache data filling, status check and space allocation.
Runtime overhead of MGO could outweigh performance benefit if cache data
count is too small, so we also need a minimum cache count threshold.

With a new runtime check against two cache data count thresholds, MGO
transformation will be:

  trans_ok = iter_count >= min_cache_data_count &&
             iter_count <= max_cache_data_count;

  if (trans_ok) {
      cache_arr = calloc (iter_count, sizeof(cache_elem));
  }

  outer-loop () {
      ...
      if (trans_ok)  {
          /* transformed inner-loop */
      }
      else {
          /* original inner-loop */
      }
  }

  if (trans_ok) {
      free (cache_arr);
  }

4. Runtime alias check

In baseline MGO, memory disambiguation for candidate load is based on
static alias analysis, which always provides very conservative result.
A more aggressive option is to use runtime alias check, which could be
easily integrated into data-caching code. In the following pseudo code,
a store outside inner-loop is aliased with the candidate load, and it
prevents the load and its dependent loads from being optimized if we
only apply the optimization via static alias analysis.

  Type1 v0;

  outer-loop () {

      v0->field = value;  /* Aliased with v1->field */

      inner-loop (iter) {

          Type1 v1 = LOAD_FN1 (iter);
          Type2 v2 = v1->field;
          ...
      }
     ...
  }


When runtime alias check is enabled, MGO will generate code to
incrementally compute address range for candidate load, and check
address overlap for may-alias stores and the range at runtime. Once
conflict is detected, execution will switch to original inner-loop. With
runtime alias check added, transformation will be:

  Type1 v0;
  std::pair<void *, void *> v1_addr_range;

  outer-loop () {

      v0->field = value;
      ...
      /* Check whether v0 is covered by range.  */
      trans_ok &= !IN_RANGE (&v1_addr_range, v0);

      if (trans_ok) {

          inner-loop (iter) {

              if (!cache_arr[cache_idx]->init) {
                 Type1 v1 = LOAD_FN1 (iter);
                 Type2 v2 = v1->field;

                 /* Extend range to cover v1. */
                 UPDATE_RANGE (&v1_addr_range, v1);
                 ...
              }
              ...
          }
      }
      else {
         /* original inner-loop */
      }
      ...
  }

Currently, we are working on an implementation for this proposal. For
sure, it might be limited, so hope your comments on it. Thanks.

Best Regards,

Feng

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2021-05-07  3:12 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-14  4:28 [RFC] A memory gathering optimization for loop Feng Xue OS
2021-01-17  1:23 ` JiangNing OS
2021-01-17 11:32   ` Richard Biener
2021-01-18  3:15     ` Feng Xue OS
2021-04-30  6:30 ` Florian Weimer
2021-04-30 12:26   ` Richard Biener
2021-04-30 12:50     ` Florian Weimer
2021-05-07  3:12   ` Feng Xue OS
  -- strict thread matches above, loose matches on Subject: below --
2021-01-14  2:40 Feng Xue OS

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).