Hello world,

this is a draft patch which interchanges the indices for FORALL and
DO CONCURRENT loops for cases like PR 82471, where code like

   DO CONCURRENT( K=1:N, J=1:M, I=1:L)
      C(I,J,K) = A(I,J,K) + B(I,J,K)
   END DO

led to very poor code because of stride issues.  Currently,
Graphite is not able to do this.

Without the patch, the code above is translated as


         i.7 = 1;
         count.10 = 512;
         while (1)
           {
             if (ANNOTATE_EXPR <count.10 <= 0, ivdep>) goto L.4;
             j.6 = 1;
             count.9 = 512;
             while (1)
               {
                 if (ANNOTATE_EXPR <count.9 <= 0, ivdep>) goto L.3;
                 k.5 = 1;
                 count.8 = 512;
                 while (1)
                   {
                     if (ANNOTATE_EXPR <count.8 <= 0, ivdep>) goto L.2;
                     (*(real(kind=4)[0:] * restrict) c.data)[((c.offset 
+ (integer(kind=8)) k.5 * c.dim[2].stride) + (integer(kind=8)) j.6 * 
c.dim[1].stride) + (integer(kind=8)) i.7] = (*(real(kind=4)[0:] * 
restrict) a.data)[((a.offset + (integer(kind=8)) k.5 * a.dim[2].stride) 
+ (integer(kind=8)) j.6 * a.dim[1].stride) + (integer(kind=8)) i.7] + 
(*(real(kind=4)[0:] * restrict) b.data)[((b.offset + (integer(kind=8)) 
k.5 * b.dim[2].stride) + (integer(kind=8)) j.6 * b.dim[1].stride) + 
(integer(kind=8)) i.7];
                     L.1:;
                     k.5 = k.5 + 1;
                     count.8 = count.8 + -1;
                   }
                 L.2:;
                 j.6 = j.6 + 1;
                 count.9 = count.9 + -1;
               }
             L.3:;
             i.7 = i.7 + 1;
             count.10 = count.10 + -1;
           }
         L.4:;

so the innermost loop has the biggest stride. With the patch
(and front-end optimization turned on), this is turned into

         k.7 = 1;
         count.10 = 512;
         while (1)
           {
             if (ANNOTATE_EXPR <count.10 <= 0, ivdep>) goto L.4;
             j.6 = 1;
             count.9 = 512;
             while (1)
               {
                 if (ANNOTATE_EXPR <count.9 <= 0, ivdep>) goto L.3;
                 i.5 = 1;
                 count.8 = 512;
                 while (1)
                   {
                     if (ANNOTATE_EXPR <count.8 <= 0, ivdep>) goto L.2;
                     (*(real(kind=4)[0:] * restrict) c.data)[((c.offset 
+ (integer(kind=8)) k.7 * c.dim[2].stride) + (integer(kind=8)) j.6 * 
c.dim[1].stride) + (integer(kind=8)) i.5] = (*(real(kind=4)[0:] * 
restrict) a.data)[((a.offset + (integer(kind=8)) k.7 * a.dim[2].stride) 
+ (integer(kind=8)) j.6 * a.dim[1].stride) + (integer(kind=8)) i.5] + 
(*(real(kind=4)[0:] * restrict) b.data)[((b.offset + (integer(kind=8)) 
k.7 * b.dim[2].stride) + (integer(kind=8)) j.6 * b.dim[1].stride) + 
(integer(kind=8)) i.5];
                     L.1:;
                     i.5 = i.5 + 1;
                     count.8 = count.8 + -1;
                   }
                 L.2:;
                 j.6 = j.6 + 1;
                 count.9 = count.9 + -1;
               }
             L.3:;
             k.7 = k.7 + 1;
             count.10 = count.10 + -1;
           }
         L.4:;

so the innermost loop is the one that gets traversed with unity stride
(the way it should have been done).

Although the algorithm here is quite simple, it resolves the issues
raised in the PR so far, and it definitely worth fixing.

If we do this kind of thing, it might also be possible to
interchange normal DO loops in a similar way (which Graphite also
cannot do at the moment, at least not if the bounds are variables).

So, comments? Suggestions? Other ideas? Any ideas how to write
a test case? Any volunteers to re-implement Graphite in the
Fortran front end (didn't think so) or to make Graphite catch
this sort of pattern (which it currently doesn't) instead?

Regards

	Thomas

2017-10-27  Thomas Koenig  <tkoenig@gcc.gnu.org>

         * frontend-passes.c (index_interchange): New funciton,
         prototype.
         (optimize_namespace): Call index_interchange.
         (ind_type): New function.
         (has_var): New function.
         (index_cost): New function.
         (loop_comp): New function.