* [patch,libgfortran] PR51119 - MATMUL slow for large matrices
@ 2016-11-14 0:09 Jerry DeLisle
2016-11-14 0:55 ` Steve Kargl
2016-11-14 7:04 ` Thomas Koenig
0 siblings, 2 replies; 11+ messages in thread
From: Jerry DeLisle @ 2016-11-14 0:09 UTC (permalink / raw)
To: fortran; +Cc: GCC Patches
[-- Attachment #1: Type: text/plain, Size: 2529 bytes --]
Hi all,
Attached patch implements a fast blocked matrix multiply. The basic algorithm is
derived from netlib.org tuned blas dgemm. See matmul.m4 for reference.
The matmul() function is compiled with -Ofast -funroll-loops. This can be
customized further if there is an undesired optimization being used. This is
accomplished using #pragma optimize ( string ).
My results on 3.8 GHz machine with single core:
$ gfc -Ofast -funroll-loops -finline-matmul-limit=32 compare.f90
$ ./a.out
=========================================================
================ MEASURED GIGAFLOPS =
=========================================================
Matmul Matmul
fixed Matmul variable
Size Loops explicit refMatmul assumed explicit
=========================================================
2 2000 23.810 0.058 0.116 0.191
4 2000 1.979 0.294 0.437 0.421
8 2000 3.089 0.826 0.928 0.993
16 2000 4.115 3.262 2.600 3.381
32 2000 6.066 5.201 3.008 4.873
64 2000 6.596 4.847 6.624 6.603
128 2000 8.389 5.965 8.370 8.375
256 477 9.520 6.003 9.449 9.452
512 59 8.563 2.783 8.359 8.500
1024 7 8.672 1.537 8.457 8.604
2048 1 8.586 1.753 8.371 8.511
Results may vary, but I found 32 is the right place to set the limit on inlining.
Regression tested on x86-64-linux power8 (gcc112)
Special thanks to Thomas for helping me test and debug. An additional test case
as well.
OK for trunk?
Best regards,
Jerry
2016-11-09 Jerry DeLisle <jvdelisle@gcc.gnu.org>
Thomas Koenig <tkoenig@gcc.gnu.org>
PR libgfortran/51119
* m4/matmul.m4: For the case of all strides = 1, implement a
fast blocked matrix multiply. Fix some whitespace.
* generated/matmul_c10.c: Regenerate.
* generated/matmul_c16.c: Regenerate.
* generated/matmul_c4.c: Regenerate.
* generated/matmul_c8.c: Regenerate.
* generated/matmul_i1.c: Regenerate.
* generated/matmul_i16.c: Regenerate.
* generated/matmul_i2.c: Regenerate.
* generated/matmul_i4.c: Regenerate.
* generated/matmul_i8.c: Regenerate.
* generated/matmul_r10.c: Regenerate.
* generated/matmul_r16.c: Regenerate.
* generated/matmul_r4.c: Regenerate.
* generated/matmul_r8.c: Regenerate.
[-- Attachment #2: pr51119.diff --]
[-- Type: text/x-patch, Size: 220149 bytes --]
diff --git a/libgfortran/generated/matmul_c10.c b/libgfortran/generated/matmul_c10.c
index c955818..6d1985d 100644
--- a/libgfortran/generated/matmul_c10.c
+++ b/libgfortran/generated/matmul_c10.c
@@ -32,7 +32,7 @@ see the files COPYING3 and COPYING.RUNTIME respectively. If not, see
#if defined (HAVE_GFC_COMPLEX_10)
/* Prototype for the BLAS ?gemm subroutine, a pointer to which can be
- passed to us by the front-end, in which case we'll call it for large
+ passed to us by the front-end, in which case we call it for large
matrices. */
typedef void (*blas_call)(const char *, const char *, const int *, const int *,
@@ -75,6 +75,9 @@ extern void matmul_c10 (gfc_array_c10 * const restrict retarray,
int blas_limit, blas_call gemm);
export_proto(matmul_c10);
+#pragma GCC optimize ( "-Ofast" )
+#pragma GCC optimize ( "-funroll-loops" )
+
void
matmul_c10 (gfc_array_c10 * const restrict retarray,
gfc_array_c10 * const restrict a, gfc_array_c10 * const restrict b, int try_blas,
@@ -99,7 +102,7 @@ matmul_c10 (gfc_array_c10 * const restrict retarray,
o One-dimensional argument B is implicitly treated as a column matrix
dimensioned [count, 1], so ycount=1.
- */
+*/
if (retarray->base_addr == NULL)
{
@@ -127,47 +130,47 @@ matmul_c10 (gfc_array_c10 * const restrict retarray,
= xmallocarray (size0 ((array_t *) retarray), sizeof (GFC_COMPLEX_10));
retarray->offset = 0;
}
- else if (unlikely (compile_options.bounds_check))
- {
- index_type ret_extent, arg_extent;
-
- if (GFC_DESCRIPTOR_RANK (a) == 1)
- {
- arg_extent = GFC_DESCRIPTOR_EXTENT(b,1);
- ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0);
- if (arg_extent != ret_extent)
- runtime_error ("Incorrect extent in return array in"
- " MATMUL intrinsic: is %ld, should be %ld",
- (long int) ret_extent, (long int) arg_extent);
- }
- else if (GFC_DESCRIPTOR_RANK (b) == 1)
- {
- arg_extent = GFC_DESCRIPTOR_EXTENT(a,0);
- ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0);
- if (arg_extent != ret_extent)
- runtime_error ("Incorrect extent in return array in"
- " MATMUL intrinsic: is %ld, should be %ld",
- (long int) ret_extent, (long int) arg_extent);
- }
- else
- {
- arg_extent = GFC_DESCRIPTOR_EXTENT(a,0);
- ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0);
- if (arg_extent != ret_extent)
- runtime_error ("Incorrect extent in return array in"
- " MATMUL intrinsic for dimension 1:"
- " is %ld, should be %ld",
- (long int) ret_extent, (long int) arg_extent);
-
- arg_extent = GFC_DESCRIPTOR_EXTENT(b,1);
- ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,1);
- if (arg_extent != ret_extent)
- runtime_error ("Incorrect extent in return array in"
- " MATMUL intrinsic for dimension 2:"
- " is %ld, should be %ld",
- (long int) ret_extent, (long int) arg_extent);
- }
- }
+ else if (unlikely (compile_options.bounds_check))
+ {
+ index_type ret_extent, arg_extent;
+
+ if (GFC_DESCRIPTOR_RANK (a) == 1)
+ {
+ arg_extent = GFC_DESCRIPTOR_EXTENT(b,1);
+ ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0);
+ if (arg_extent != ret_extent)
+ runtime_error ("Incorrect extent in return array in"
+ " MATMUL intrinsic: is %ld, should be %ld",
+ (long int) ret_extent, (long int) arg_extent);
+ }
+ else if (GFC_DESCRIPTOR_RANK (b) == 1)
+ {
+ arg_extent = GFC_DESCRIPTOR_EXTENT(a,0);
+ ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0);
+ if (arg_extent != ret_extent)
+ runtime_error ("Incorrect extent in return array in"
+ " MATMUL intrinsic: is %ld, should be %ld",
+ (long int) ret_extent, (long int) arg_extent);
+ }
+ else
+ {
+ arg_extent = GFC_DESCRIPTOR_EXTENT(a,0);
+ ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0);
+ if (arg_extent != ret_extent)
+ runtime_error ("Incorrect extent in return array in"
+ " MATMUL intrinsic for dimension 1:"
+ " is %ld, should be %ld",
+ (long int) ret_extent, (long int) arg_extent);
+
+ arg_extent = GFC_DESCRIPTOR_EXTENT(b,1);
+ ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,1);
+ if (arg_extent != ret_extent)
+ runtime_error ("Incorrect extent in return array in"
+ " MATMUL intrinsic for dimension 2:"
+ " is %ld, should be %ld",
+ (long int) ret_extent, (long int) arg_extent);
+ }
+ }
if (GFC_DESCRIPTOR_RANK (retarray) == 1)
@@ -230,61 +233,294 @@ matmul_c10 (gfc_array_c10 * const restrict retarray,
bbase = b->base_addr;
dest = retarray->base_addr;
-
- /* Now that everything is set up, we're performing the multiplication
+ /* Now that everything is set up, we perform the multiplication
itself. */
#define POW3(x) (((float) (x)) * ((float) (x)) * ((float) (x)))
+#define min(a,b) ((a) <= (b) ? (a) : (b))
+#define max(a,b) ((a) >= (b) ? (a) : (b))
if (try_blas && rxstride == 1 && (axstride == 1 || aystride == 1)
&& (bxstride == 1 || bystride == 1)
&& (((float) xcount) * ((float) ycount) * ((float) count)
> POW3(blas_limit)))
- {
- const int m = xcount, n = ycount, k = count, ldc = rystride;
- const GFC_COMPLEX_10 one = 1, zero = 0;
- const int lda = (axstride == 1) ? aystride : axstride,
- ldb = (bxstride == 1) ? bystride : bxstride;
-
- if (lda > 0 && ldb > 0 && ldc > 0 && m > 1 && n > 1 && k > 1)
- {
- assert (gemm != NULL);
- gemm (axstride == 1 ? "N" : "T", bxstride == 1 ? "N" : "T", &m, &n, &k,
- &one, abase, &lda, bbase, &ldb, &zero, dest, &ldc, 1, 1);
- return;
- }
- }
-
- if (rxstride == 1 && axstride == 1 && bxstride == 1)
{
- const GFC_COMPLEX_10 * restrict bbase_y;
- GFC_COMPLEX_10 * restrict dest_y;
- const GFC_COMPLEX_10 * restrict abase_n;
- GFC_COMPLEX_10 bbase_yn;
+ const int m = xcount, n = ycount, k = count, ldc = rystride;
+ const GFC_COMPLEX_10 one = 1, zero = 0;
+ const int lda = (axstride == 1) ? aystride : axstride,
+ ldb = (bxstride == 1) ? bystride : bxstride;
- if (rystride == xcount)
- memset (dest, 0, (sizeof (GFC_COMPLEX_10) * xcount * ycount));
- else
+ if (lda > 0 && ldb > 0 && ldc > 0 && m > 1 && n > 1 && k > 1)
{
- for (y = 0; y < ycount; y++)
- for (x = 0; x < xcount; x++)
- dest[x + y*rystride] = (GFC_COMPLEX_10)0;
+ assert (gemm != NULL);
+ gemm (axstride == 1 ? "N" : "T", bxstride == 1 ? "N" : "T", &m,
+ &n, &k, &one, abase, &lda, bbase, &ldb, &zero, dest,
+ &ldc, 1, 1);
+ return;
}
+ }
- for (y = 0; y < ycount; y++)
+ if (rxstride == 1 && axstride == 1 && bxstride == 1)
+ {
+ /* This block of code implements a tuned matmul, derived from
+ Superscalar GEMM-based level 3 BLAS, Beta version 0.1
+
+ Bo Kagstrom and Per Ling
+ Department of Computing Science
+ Umea University
+ S-901 87 Umea, Sweden
+
+ from netlib.org, translated to C, and modified for matmul.m4. */
+
+ const GFC_COMPLEX_10 *a, *b;
+ GFC_COMPLEX_10 *c;
+ const index_type m = xcount, n = ycount, k = count;
+
+ /* System generated locals */
+ index_type a_dim1, a_offset, b_dim1, b_offset, c_dim1, c_offset, i1, i2,
+ i3, i4, i5, i6;
+
+ /* Local variables */
+ GFC_COMPLEX_10 t1[65536], /* was [256][256] */
+ f11, f12, f21, f22, f31, f32, f41, f42,
+ f13, f14, f23, f24, f33, f34, f43, f44;
+ index_type i, j, l, ii, jj, ll;
+ index_type isec, jsec, lsec, uisec, ujsec, ulsec;
+
+ a = abase;
+ b = bbase;
+ c = retarray->base_addr;
+
+ /* Parameter adjustments */
+ c_dim1 = m;
+ c_offset = 1 + c_dim1;
+ c -= c_offset;
+ a_dim1 = aystride;
+ a_offset = 1 + a_dim1;
+ a -= a_offset;
+ b_dim1 = bystride;
+ b_offset = 1 + b_dim1;
+ b -= b_offset;
+
+ /* Early exit if possible */
+ if (m == 0 || n == 0 || k == 0)
+ return;
+
+ /* Empty c first. */
+ for (j=1; j<=n; j++)
+ for (i=1; i<=m; i++)
+ c[i + j * c_dim1] = (GFC_COMPLEX_10)0;
+
+ /* Start turning the crank. */
+ i1 = n;
+ for (jj = 1; jj <= i1; jj += 512)
{
- bbase_y = bbase + y*bystride;
- dest_y = dest + y*rystride;
- for (n = 0; n < count; n++)
+ /* Computing MIN */
+ i2 = 512;
+ i3 = n - jj + 1;
+ jsec = min(i2,i3);
+ ujsec = jsec - jsec % 4;
+ i2 = k;
+ for (ll = 1; ll <= i2; ll += 256)
{
- abase_n = abase + n*aystride;
- bbase_yn = bbase_y[n];
- for (x = 0; x < xcount; x++)
+ /* Computing MIN */
+ i3 = 256;
+ i4 = k - ll + 1;
+ lsec = min(i3,i4);
+ ulsec = lsec - lsec % 2;
+
+ i3 = m;
+ for (ii = 1; ii <= i3; ii += 256)
{
- dest_y[x] += abase_n[x] * bbase_yn;
+ /* Computing MIN */
+ i4 = 256;
+ i5 = m - ii + 1;
+ isec = min(i4,i5);
+ uisec = isec - isec % 2;
+ i4 = ll + ulsec - 1;
+ for (l = ll; l <= i4; l += 2)
+ {
+ i5 = ii + uisec - 1;
+ for (i = ii; i <= i5; i += 2)
+ {
+ t1[l - ll + 1 + ((i - ii + 1) << 8) - 257] =
+ a[i + l * a_dim1];
+ t1[l - ll + 2 + ((i - ii + 1) << 8) - 257] =
+ a[i + (l + 1) * a_dim1];
+ t1[l - ll + 1 + ((i - ii + 2) << 8) - 257] =
+ a[i + 1 + l * a_dim1];
+ t1[l - ll + 2 + ((i - ii + 2) << 8) - 257] =
+ a[i + 1 + (l + 1) * a_dim1];
+ }
+ if (uisec < isec)
+ {
+ t1[l - ll + 1 + (isec << 8) - 257] =
+ a[ii + isec - 1 + l * a_dim1];
+ t1[l - ll + 2 + (isec << 8) - 257] =
+ a[ii + isec - 1 + (l + 1) * a_dim1];
+ }
+ }
+ if (ulsec < lsec)
+ {
+ i4 = ii + isec - 1;
+ for (i = ii; i<= i4; ++i)
+ {
+ t1[lsec + ((i - ii + 1) << 8) - 257] =
+ a[i + (ll + lsec - 1) * a_dim1];
+ }
+ }
+
+ uisec = isec - isec % 4;
+ i4 = jj + ujsec - 1;
+ for (j = jj; j <= i4; j += 4)
+ {
+ i5 = ii + uisec - 1;
+ for (i = ii; i <= i5; i += 4)
+ {
+ f11 = c[i + j * c_dim1];
+ f21 = c[i + 1 + j * c_dim1];
+ f12 = c[i + (j + 1) * c_dim1];
+ f22 = c[i + 1 + (j + 1) * c_dim1];
+ f13 = c[i + (j + 2) * c_dim1];
+ f23 = c[i + 1 + (j + 2) * c_dim1];
+ f14 = c[i + (j + 3) * c_dim1];
+ f24 = c[i + 1 + (j + 3) * c_dim1];
+ f31 = c[i + 2 + j * c_dim1];
+ f41 = c[i + 3 + j * c_dim1];
+ f32 = c[i + 2 + (j + 1) * c_dim1];
+ f42 = c[i + 3 + (j + 1) * c_dim1];
+ f33 = c[i + 2 + (j + 2) * c_dim1];
+ f43 = c[i + 3 + (j + 2) * c_dim1];
+ f34 = c[i + 2 + (j + 3) * c_dim1];
+ f44 = c[i + 3 + (j + 3) * c_dim1];
+ i6 = ll + lsec - 1;
+ for (l = ll; l <= i6; ++l)
+ {
+ f11 += t1[l - ll + 1 + ((i - ii + 1) << 8) - 257]
+ * b[l + j * b_dim1];
+ f21 += t1[l - ll + 1 + ((i - ii + 2) << 8) - 257]
+ * b[l + j * b_dim1];
+ f12 += t1[l - ll + 1 + ((i - ii + 1) << 8) - 257]
+ * b[l + (j + 1) * b_dim1];
+ f22 += t1[l - ll + 1 + ((i - ii + 2) << 8) - 257]
+ * b[l + (j + 1) * b_dim1];
+ f13 += t1[l - ll + 1 + ((i - ii + 1) << 8) - 257]
+ * b[l + (j + 2) * b_dim1];
+ f23 += t1[l - ll + 1 + ((i - ii + 2) << 8) - 257]
+ * b[l + (j + 2) * b_dim1];
+ f14 += t1[l - ll + 1 + ((i - ii + 1) << 8) - 257]
+ * b[l + (j + 3) * b_dim1];
+ f24 += t1[l - ll + 1 + ((i - ii + 2) << 8) - 257]
+ * b[l + (j + 3) * b_dim1];
+ f31 += t1[l - ll + 1 + ((i - ii + 3) << 8) - 257]
+ * b[l + j * b_dim1];
+ f41 += t1[l - ll + 1 + ((i - ii + 4) << 8) - 257]
+ * b[l + j * b_dim1];
+ f32 += t1[l - ll + 1 + ((i - ii + 3) << 8) - 257]
+ * b[l + (j + 1) * b_dim1];
+ f42 += t1[l - ll + 1 + ((i - ii + 4) << 8) - 257]
+ * b[l + (j + 1) * b_dim1];
+ f33 += t1[l - ll + 1 + ((i - ii + 3) << 8) - 257]
+ * b[l + (j + 2) * b_dim1];
+ f43 += t1[l - ll + 1 + ((i - ii + 4) << 8) - 257]
+ * b[l + (j + 2) * b_dim1];
+ f34 += t1[l - ll + 1 + ((i - ii + 3) << 8) - 257]
+ * b[l + (j + 3) * b_dim1];
+ f44 += t1[l - ll + 1 + ((i - ii + 4) << 8) - 257]
+ * b[l + (j + 3) * b_dim1];
+ }
+ c[i + j * c_dim1] = f11;
+ c[i + 1 + j * c_dim1] = f21;
+ c[i + (j + 1) * c_dim1] = f12;
+ c[i + 1 + (j + 1) * c_dim1] = f22;
+ c[i + (j + 2) * c_dim1] = f13;
+ c[i + 1 + (j + 2) * c_dim1] = f23;
+ c[i + (j + 3) * c_dim1] = f14;
+ c[i + 1 + (j + 3) * c_dim1] = f24;
+ c[i + 2 + j * c_dim1] = f31;
+ c[i + 3 + j * c_dim1] = f41;
+ c[i + 2 + (j + 1) * c_dim1] = f32;
+ c[i + 3 + (j + 1) * c_dim1] = f42;
+ c[i + 2 + (j + 2) * c_dim1] = f33;
+ c[i + 3 + (j + 2) * c_dim1] = f43;
+ c[i + 2 + (j + 3) * c_dim1] = f34;
+ c[i + 3 + (j + 3) * c_dim1] = f44;
+ }
+ if (uisec < isec)
+ {
+ i5 = ii + isec - 1;
+ for (i = ii + uisec; i <= i5; ++i)
+ {
+ f11 = c[i + j * c_dim1];
+ f12 = c[i + (j + 1) * c_dim1];
+ f13 = c[i + (j + 2) * c_dim1];
+ f14 = c[i + (j + 3) * c_dim1];
+ i6 = ll + lsec - 1;
+ for (l = ll; l <= i6; ++l)
+ {
+ f11 += t1[l - ll + 1 + ((i - ii + 1) << 8) -
+ 257] * b[l + j * b_dim1];
+ f12 += t1[l - ll + 1 + ((i - ii + 1) << 8) -
+ 257] * b[l + (j + 1) * b_dim1];
+ f13 += t1[l - ll + 1 + ((i - ii + 1) << 8) -
+ 257] * b[l + (j + 2) * b_dim1];
+ f14 += t1[l - ll + 1 + ((i - ii + 1) << 8) -
+ 257] * b[l + (j + 3) * b_dim1];
+ }
+ c[i + j * c_dim1] = f11;
+ c[i + (j + 1) * c_dim1] = f12;
+ c[i + (j + 2) * c_dim1] = f13;
+ c[i + (j + 3) * c_dim1] = f14;
+ }
+ }
+ }
+ if (ujsec < jsec)
+ {
+ i4 = jj + jsec - 1;
+ for (j = jj + ujsec; j <= i4; ++j)
+ {
+ i5 = ii + uisec - 1;
+ for (i = ii; i <= i5; i += 4)
+ {
+ f11 = c[i + j * c_dim1];
+ f21 = c[i + 1 + j * c_dim1];
+ f31 = c[i + 2 + j * c_dim1];
+ f41 = c[i + 3 + j * c_dim1];
+ i6 = ll + lsec - 1;
+ for (l = ll; l <= i6; ++l)
+ {
+ f11 += t1[l - ll + 1 + ((i - ii + 1) << 8) -
+ 257] * b[l + j * b_dim1];
+ f21 += t1[l - ll + 1 + ((i - ii + 2) << 8) -
+ 257] * b[l + j * b_dim1];
+ f31 += t1[l - ll + 1 + ((i - ii + 3) << 8) -
+ 257] * b[l + j * b_dim1];
+ f41 += t1[l - ll + 1 + ((i - ii + 4) << 8) -
+ 257] * b[l + j * b_dim1];
+ }
+ c[i + j * c_dim1] = f11;
+ c[i + 1 + j * c_dim1] = f21;
+ c[i + 2 + j * c_dim1] = f31;
+ c[i + 3 + j * c_dim1] = f41;
+ }
+ i5 = ii + isec - 1;
+ for (i = ii + uisec; i <= i5; ++i)
+ {
+ f11 = c[i + j * c_dim1];
+ i6 = ll + lsec - 1;
+ for (l = ll; l <= i6; ++l)
+ {
+ f11 += t1[l - ll + 1 + ((i - ii + 1) << 8) -
+ 257] * b[l + j * b_dim1];
+ }
+ c[i + j * c_dim1] = f11;
+ }
+ }
+ }
}
}
}
+ return;
}
else if (rxstride == 1 && aystride == 1 && bxstride == 1)
{
@@ -334,7 +570,9 @@ matmul_c10 (gfc_array_c10 * const restrict retarray,
for (n = 0; n < count; n++)
for (x = 0; x < xcount; x++)
/* dest[x,y] += a[x,n] * b[n,y] */
- dest[x*rxstride + y*rystride] += abase[x*axstride + n*aystride] * bbase[n*bxstride + y*bystride];
+ dest[x*rxstride + y*rystride] +=
+ abase[x*axstride + n*aystride] *
+ bbase[n*bxstride + y*bystride];
}
else if (GFC_DESCRIPTOR_RANK (a) == 1)
{
@@ -372,5 +610,5 @@ matmul_c10 (gfc_array_c10 * const restrict retarray,
}
}
}
-
+#pragma GCC reset_options
#endif
diff --git a/libgfortran/generated/matmul_c16.c b/libgfortran/generated/matmul_c16.c
index 25fe56e..d967483 100644
--- a/libgfortran/generated/matmul_c16.c
+++ b/libgfortran/generated/matmul_c16.c
@@ -32,7 +32,7 @@ see the files COPYING3 and COPYING.RUNTIME respectively. If not, see
#if defined (HAVE_GFC_COMPLEX_16)
/* Prototype for the BLAS ?gemm subroutine, a pointer to which can be
- passed to us by the front-end, in which case we'll call it for large
+ passed to us by the front-end, in which case we call it for large
matrices. */
typedef void (*blas_call)(const char *, const char *, const int *, const int *,
@@ -75,6 +75,9 @@ extern void matmul_c16 (gfc_array_c16 * const restrict retarray,
int blas_limit, blas_call gemm);
export_proto(matmul_c16);
+#pragma GCC optimize ( "-Ofast" )
+#pragma GCC optimize ( "-funroll-loops" )
+
void
matmul_c16 (gfc_array_c16 * const restrict retarray,
gfc_array_c16 * const restrict a, gfc_array_c16 * const restrict b, int try_blas,
@@ -99,7 +102,7 @@ matmul_c16 (gfc_array_c16 * const restrict retarray,
o One-dimensional argument B is implicitly treated as a column matrix
dimensioned [count, 1], so ycount=1.
- */
+*/
if (retarray->base_addr == NULL)
{
@@ -127,47 +130,47 @@ matmul_c16 (gfc_array_c16 * const restrict retarray,
= xmallocarray (size0 ((array_t *) retarray), sizeof (GFC_COMPLEX_16));
retarray->offset = 0;
}
- else if (unlikely (compile_options.bounds_check))
- {
- index_type ret_extent, arg_extent;
-
- if (GFC_DESCRIPTOR_RANK (a) == 1)
- {
- arg_extent = GFC_DESCRIPTOR_EXTENT(b,1);
- ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0);
- if (arg_extent != ret_extent)
- runtime_error ("Incorrect extent in return array in"
- " MATMUL intrinsic: is %ld, should be %ld",
- (long int) ret_extent, (long int) arg_extent);
- }
- else if (GFC_DESCRIPTOR_RANK (b) == 1)
- {
- arg_extent = GFC_DESCRIPTOR_EXTENT(a,0);
- ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0);
- if (arg_extent != ret_extent)
- runtime_error ("Incorrect extent in return array in"
- " MATMUL intrinsic: is %ld, should be %ld",
- (long int) ret_extent, (long int) arg_extent);
- }
- else
- {
- arg_extent = GFC_DESCRIPTOR_EXTENT(a,0);
- ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0);
- if (arg_extent != ret_extent)
- runtime_error ("Incorrect extent in return array in"
- " MATMUL intrinsic for dimension 1:"
- " is %ld, should be %ld",
- (long int) ret_extent, (long int) arg_extent);
-
- arg_extent = GFC_DESCRIPTOR_EXTENT(b,1);
- ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,1);
- if (arg_extent != ret_extent)
- runtime_error ("Incorrect extent in return array in"
- " MATMUL intrinsic for dimension 2:"
- " is %ld, should be %ld",
- (long int) ret_extent, (long int) arg_extent);
- }
- }
+ else if (unlikely (compile_options.bounds_check))
+ {
+ index_type ret_extent, arg_extent;
+
+ if (GFC_DESCRIPTOR_RANK (a) == 1)
+ {
+ arg_extent = GFC_DESCRIPTOR_EXTENT(b,1);
+ ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0);
+ if (arg_extent != ret_extent)
+ runtime_error ("Incorrect extent in return array in"
+ " MATMUL intrinsic: is %ld, should be %ld",
+ (long int) ret_extent, (long int) arg_extent);
+ }
+ else if (GFC_DESCRIPTOR_RANK (b) == 1)
+ {
+ arg_extent = GFC_DESCRIPTOR_EXTENT(a,0);
+ ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0);
+ if (arg_extent != ret_extent)
+ runtime_error ("Incorrect extent in return array in"
+ " MATMUL intrinsic: is %ld, should be %ld",
+ (long int) ret_extent, (long int) arg_extent);
+ }
+ else
+ {
+ arg_extent = GFC_DESCRIPTOR_EXTENT(a,0);
+ ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0);
+ if (arg_extent != ret_extent)
+ runtime_error ("Incorrect extent in return array in"
+ " MATMUL intrinsic for dimension 1:"
+ " is %ld, should be %ld",
+ (long int) ret_extent, (long int) arg_extent);
+
+ arg_extent = GFC_DESCRIPTOR_EXTENT(b,1);
+ ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,1);
+ if (arg_extent != ret_extent)
+ runtime_error ("Incorrect extent in return array in"
+ " MATMUL intrinsic for dimension 2:"
+ " is %ld, should be %ld",
+ (long int) ret_extent, (long int) arg_extent);
+ }
+ }
if (GFC_DESCRIPTOR_RANK (retarray) == 1)
@@ -230,61 +233,294 @@ matmul_c16 (gfc_array_c16 * const restrict retarray,
bbase = b->base_addr;
dest = retarray->base_addr;
-
- /* Now that everything is set up, we're performing the multiplication
+ /* Now that everything is set up, we perform the multiplication
itself. */
#define POW3(x) (((float) (x)) * ((float) (x)) * ((float) (x)))
+#define min(a,b) ((a) <= (b) ? (a) : (b))
+#define max(a,b) ((a) >= (b) ? (a) : (b))
if (try_blas && rxstride == 1 && (axstride == 1 || aystride == 1)
&& (bxstride == 1 || bystride == 1)
&& (((float) xcount) * ((float) ycount) * ((float) count)
> POW3(blas_limit)))
- {
- const int m = xcount, n = ycount, k = count, ldc = rystride;
- const GFC_COMPLEX_16 one = 1, zero = 0;
- const int lda = (axstride == 1) ? aystride : axstride,
- ldb = (bxstride == 1) ? bystride : bxstride;
-
- if (lda > 0 && ldb > 0 && ldc > 0 && m > 1 && n > 1 && k > 1)
- {
- assert (gemm != NULL);
- gemm (axstride == 1 ? "N" : "T", bxstride == 1 ? "N" : "T", &m, &n, &k,
- &one, abase, &lda, bbase, &ldb, &zero, dest, &ldc, 1, 1);
- return;
- }
- }
-
- if (rxstride == 1 && axstride == 1 && bxstride == 1)
{
- const GFC_COMPLEX_16 * restrict bbase_y;
- GFC_COMPLEX_16 * restrict dest_y;
- const GFC_COMPLEX_16 * restrict abase_n;
- GFC_COMPLEX_16 bbase_yn;
+ const int m = xcount, n = ycount, k = count, ldc = rystride;
+ const GFC_COMPLEX_16 one = 1, zero = 0;
+ const int lda = (axstride == 1) ? aystride : axstride,
+ ldb = (bxstride == 1) ? bystride : bxstride;
- if (rystride == xcount)
- memset (dest, 0, (sizeof (GFC_COMPLEX_16) * xcount * ycount));
- else
+ if (lda > 0 && ldb > 0 && ldc > 0 && m > 1 && n > 1 && k > 1)
{
- for (y = 0; y < ycount; y++)
- for (x = 0; x < xcount; x++)
- dest[x + y*rystride] = (GFC_COMPLEX_16)0;
+ assert (gemm != NULL);
+ gemm (axstride == 1 ? "N" : "T", bxstride == 1 ? "N" : "T", &m,
+ &n, &k, &one, abase, &lda, bbase, &ldb, &zero, dest,
+ &ldc, 1, 1);
+ return;
}
+ }
- for (y = 0; y < ycount; y++)
+ if (rxstride == 1 && axstride == 1 && bxstride == 1)
+ {
+ /* This block of code implements a tuned matmul, derived from
+ Superscalar GEMM-based level 3 BLAS, Beta version 0.1
+
+ Bo Kagstrom and Per Ling
+ Department of Computing Science
+ Umea University
+ S-901 87 Umea, Sweden
+
+ from netlib.org, translated to C, and modified for matmul.m4. */
+
+ const GFC_COMPLEX_16 *a, *b;
+ GFC_COMPLEX_16 *c;
+ const index_type m = xcount, n = ycount, k = count;
+
+ /* System generated locals */
+ index_type a_dim1, a_offset, b_dim1, b_offset, c_dim1, c_offset, i1, i2,
+ i3, i4, i5, i6;
+
+ /* Local variables */
+ GFC_COMPLEX_16 t1[65536], /* was [256][256] */
+ f11, f12, f21, f22, f31, f32, f41, f42,
+ f13, f14, f23, f24, f33, f34, f43, f44;
+ index_type i, j, l, ii, jj, ll;
+ index_type isec, jsec, lsec, uisec, ujsec, ulsec;
+
+ a = abase;
+ b = bbase;
+ c = retarray->base_addr;
+
+ /* Parameter adjustments */
+ c_dim1 = m;
+ c_offset = 1 + c_dim1;
+ c -= c_offset;
+ a_dim1 = aystride;
+ a_offset = 1 + a_dim1;
+ a -= a_offset;
+ b_dim1 = bystride;
+ b_offset = 1 + b_dim1;
+ b -= b_offset;
+
+ /* Early exit if possible */
+ if (m == 0 || n == 0 || k == 0)
+ return;
+
+ /* Empty c first. */
+ for (j=1; j<=n; j++)
+ for (i=1; i<=m; i++)
+ c[i + j * c_dim1] = (GFC_COMPLEX_16)0;
+
+ /* Start turning the crank. */
+ i1 = n;
+ for (jj = 1; jj <= i1; jj += 512)
{
- bbase_y = bbase + y*bystride;
- dest_y = dest + y*rystride;
- for (n = 0; n < count; n++)
+ /* Computing MIN */
+ i2 = 512;
+ i3 = n - jj + 1;
+ jsec = min(i2,i3);
+ ujsec = jsec - jsec % 4;
+ i2 = k;
+ for (ll = 1; ll <= i2; ll += 256)
{
- abase_n = abase + n*aystride;
- bbase_yn = bbase_y[n];
- for (x = 0; x < xcount; x++)
+ /* Computing MIN */
+ i3 = 256;
+ i4 = k - ll + 1;
+ lsec = min(i3,i4);
+ ulsec = lsec - lsec % 2;
+
+ i3 = m;
+ for (ii = 1; ii <= i3; ii += 256)
{
- dest_y[x] += abase_n[x] * bbase_yn;
+ /* Computing MIN */
+ i4 = 256;
+ i5 = m - ii + 1;
+ isec = min(i4,i5);
+ uisec = isec - isec % 2;
+ i4 = ll + ulsec - 1;
+ for (l = ll; l <= i4; l += 2)
+ {
+ i5 = ii + uisec - 1;
+ for (i = ii; i <= i5; i += 2)
+ {
+ t1[l - ll + 1 + ((i - ii + 1) << 8) - 257] =
+ a[i + l * a_dim1];
+ t1[l - ll + 2 + ((i - ii + 1) << 8) - 257] =
+ a[i + (l + 1) * a_dim1];
+ t1[l - ll + 1 + ((i - ii + 2) << 8) - 257] =
+ a[i + 1 + l * a_dim1];
+ t1[l - ll + 2 + ((i - ii + 2) << 8) - 257] =
+ a[i + 1 + (l + 1) * a_dim1];
+ }
+ if (uisec < isec)
+ {
+ t1[l - ll + 1 + (isec << 8) - 257] =
+ a[ii + isec - 1 + l * a_dim1];
+ t1[l - ll + 2 + (isec << 8) - 257] =
+ a[ii + isec - 1 + (l + 1) * a_dim1];
+ }
+ }
+ if (ulsec < lsec)
+ {
+ i4 = ii + isec - 1;
+ for (i = ii; i<= i4; ++i)
+ {
+ t1[lsec + ((i - ii + 1) << 8) - 257] =
+ a[i + (ll + lsec - 1) * a_dim1];
+ }
+ }
+
+ uisec = isec - isec % 4;
+ i4 = jj + ujsec - 1;
+ for (j = jj; j <= i4; j += 4)
+ {
+ i5 = ii + uisec - 1;
+ for (i = ii; i <= i5; i += 4)
+ {
+ f11 = c[i + j * c_dim1];
+ f21 = c[i + 1 + j * c_dim1];
+ f12 = c[i + (j + 1) * c_dim1];
+ f22 = c[i + 1 + (j + 1) * c_dim1];
+ f13 = c[i + (j + 2) * c_dim1];
+ f23 = c[i + 1 + (j + 2) * c_dim1];
+ f14 = c[i + (j + 3) * c_dim1];
+ f24 = c[i + 1 + (j + 3) * c_dim1];
+ f31 = c[i + 2 + j * c_dim1];
+ f41 = c[i + 3 + j * c_dim1];
+ f32 = c[i + 2 + (j + 1) * c_dim1];
+ f42 = c[i + 3 + (j + 1) * c_dim1];
+ f33 = c[i + 2 + (j + 2) * c_dim1];
+ f43 = c[i + 3 + (j + 2) * c_dim1];
+ f34 = c[i + 2 + (j + 3) * c_dim1];
+ f44 = c[i + 3 + (j + 3) * c_dim1];
+ i6 = ll + lsec - 1;
+ for (l = ll; l <= i6; ++l)
+ {
+ f11 += t1[l - ll + 1 + ((i - ii + 1) << 8) - 257]
+ * b[l + j * b_dim1];
+ f21 += t1[l - ll + 1 + ((i - ii + 2) << 8) - 257]
+ * b[l + j * b_dim1];
+ f12 += t1[l - ll + 1 + ((i - ii + 1) << 8) - 257]
+ * b[l + (j + 1) * b_dim1];
+ f22 += t1[l - ll + 1 + ((i - ii + 2) << 8) - 257]
+ * b[l + (j + 1) * b_dim1];
+ f13 += t1[l - ll + 1 + ((i - ii + 1) << 8) - 257]
+ * b[l + (j + 2) * b_dim1];
+ f23 += t1[l - ll + 1 + ((i - ii + 2) << 8) - 257]
+ * b[l + (j + 2) * b_dim1];
+ f14 += t1[l - ll + 1 + ((i - ii + 1) << 8) - 257]
+ * b[l + (j + 3) * b_dim1];
+ f24 += t1[l - ll + 1 + ((i - ii + 2) << 8) - 257]
+ * b[l + (j + 3) * b_dim1];
+ f31 += t1[l - ll + 1 + ((i - ii + 3) << 8) - 257]
+ * b[l + j * b_dim1];
+ f41 += t1[l - ll + 1 + ((i - ii + 4) << 8) - 257]
+ * b[l + j * b_dim1];
+ f32 += t1[l - ll + 1 + ((i - ii + 3) << 8) - 257]
+ * b[l + (j + 1) * b_dim1];
+ f42 += t1[l - ll + 1 + ((i - ii + 4) << 8) - 257]
+ * b[l + (j + 1) * b_dim1];
+ f33 += t1[l - ll + 1 + ((i - ii + 3) << 8) - 257]
+ * b[l + (j + 2) * b_dim1];
+ f43 += t1[l - ll + 1 + ((i - ii + 4) << 8) - 257]
+ * b[l + (j + 2) * b_dim1];
+ f34 += t1[l - ll + 1 + ((i - ii + 3) << 8) - 257]
+ * b[l + (j + 3) * b_dim1];
+ f44 += t1[l - ll + 1 + ((i - ii + 4) << 8) - 257]
+ * b[l + (j + 3) * b_dim1];
+ }
+ c[i + j * c_dim1] = f11;
+ c[i + 1 + j * c_dim1] = f21;
+ c[i + (j + 1) * c_dim1] = f12;
+ c[i + 1 + (j + 1) * c_dim1] = f22;
+ c[i + (j + 2) * c_dim1] = f13;
+ c[i + 1 + (j + 2) * c_dim1] = f23;
+ c[i + (j + 3) * c_dim1] = f14;
+ c[i + 1 + (j + 3) * c_dim1] = f24;
+ c[i + 2 + j * c_dim1] = f31;
+ c[i + 3 + j * c_dim1] = f41;
+ c[i + 2 + (j + 1) * c_dim1] = f32;
+ c[i + 3 + (j + 1) * c_dim1] = f42;
+ c[i + 2 + (j + 2) * c_dim1] = f33;
+ c[i + 3 + (j + 2) * c_dim1] = f43;
+ c[i + 2 + (j + 3) * c_dim1] = f34;
+ c[i + 3 + (j + 3) * c_dim1] = f44;
+ }
+ if (uisec < isec)
+ {
+ i5 = ii + isec - 1;
+ for (i = ii + uisec; i <= i5; ++i)
+ {
+ f11 = c[i + j * c_dim1];
+ f12 = c[i + (j + 1) * c_dim1];
+ f13 = c[i + (j + 2) * c_dim1];
+ f14 = c[i + (j + 3) * c_dim1];
+ i6 = ll + lsec - 1;
+ for (l = ll; l <= i6; ++l)
+ {
+ f11 += t1[l - ll + 1 + ((i - ii + 1) << 8) -
+ 257] * b[l + j * b_dim1];
+ f12 += t1[l - ll + 1 + ((i - ii + 1) << 8) -
+ 257] * b[l + (j + 1) * b_dim1];
+ f13 += t1[l - ll + 1 + ((i - ii + 1) << 8) -
+ 257] * b[l + (j + 2) * b_dim1];
+ f14 += t1[l - ll + 1 + ((i - ii + 1) << 8) -
+ 257] * b[l + (j + 3) * b_dim1];
+ }
+ c[i + j * c_dim1] = f11;
+ c[i + (j + 1) * c_dim1] = f12;
+ c[i + (j + 2) * c_dim1] = f13;
+ c[i + (j + 3) * c_dim1] = f14;
+ }
+ }
+ }
+ if (ujsec < jsec)
+ {
+ i4 = jj + jsec - 1;
+ for (j = jj + ujsec; j <= i4; ++j)
+ {
+ i5 = ii + uisec - 1;
+ for (i = ii; i <= i5; i += 4)
+ {
+ f11 = c[i + j * c_dim1];
+ f21 = c[i + 1 + j * c_dim1];
+ f31 = c[i + 2 + j * c_dim1];
+ f41 = c[i + 3 + j * c_dim1];
+ i6 = ll + lsec - 1;
+ for (l = ll; l <= i6; ++l)
+ {
+ f11 += t1[l - ll + 1 + ((i - ii + 1) << 8) -
+ 257] * b[l + j * b_dim1];
+ f21 += t1[l - ll + 1 + ((i - ii + 2) << 8) -
+ 257] * b[l + j * b_dim1];
+ f31 += t1[l - ll + 1 + ((i - ii + 3) << 8) -
+ 257] * b[l + j * b_dim1];
+ f41 += t1[l - ll + 1 + ((i - ii + 4) << 8) -
+ 257] * b[l + j * b_dim1];
+ }
+ c[i + j * c_dim1] = f11;
+ c[i + 1 + j * c_dim1] = f21;
+ c[i + 2 + j * c_dim1] = f31;
+ c[i + 3 + j * c_dim1] = f41;
+ }
+ i5 = ii + isec - 1;
+ for (i = ii + uisec; i <= i5; ++i)
+ {
+ f11 = c[i + j * c_dim1];
+ i6 = ll + lsec - 1;
+ for (l = ll; l <= i6; ++l)
+ {
+ f11 += t1[l - ll + 1 + ((i - ii + 1) << 8) -
+ 257] * b[l + j * b_dim1];
+ }
+ c[i + j * c_dim1] = f11;
+ }
+ }
+ }
}
}
}
+ return;
}
else if (rxstride == 1 && aystride == 1 && bxstride == 1)
{
@@ -334,7 +570,9 @@ matmul_c16 (gfc_array_c16 * const restrict retarray,
for (n = 0; n < count; n++)
for (x = 0; x < xcount; x++)
/* dest[x,y] += a[x,n] * b[n,y] */
- dest[x*rxstride + y*rystride] += abase[x*axstride + n*aystride] * bbase[n*bxstride + y*bystride];
+ dest[x*rxstride + y*rystride] +=
+ abase[x*axstride + n*aystride] *
+ bbase[n*bxstride + y*bystride];
}
else if (GFC_DESCRIPTOR_RANK (a) == 1)
{
@@ -372,5 +610,5 @@ matmul_c16 (gfc_array_c16 * const restrict retarray,
}
}
}
-
+#pragma GCC reset_options
#endif
diff --git a/libgfortran/generated/matmul_c4.c b/libgfortran/generated/matmul_c4.c
index e9d2ed3..0305782 100644
--- a/libgfortran/generated/matmul_c4.c
+++ b/libgfortran/generated/matmul_c4.c
@@ -32,7 +32,7 @@ see the files COPYING3 and COPYING.RUNTIME respectively. If not, see
#if defined (HAVE_GFC_COMPLEX_4)
/* Prototype for the BLAS ?gemm subroutine, a pointer to which can be
- passed to us by the front-end, in which case we'll call it for large
+ passed to us by the front-end, in which case we call it for large
matrices. */
typedef void (*blas_call)(const char *, const char *, const int *, const int *,
@@ -75,6 +75,9 @@ extern void matmul_c4 (gfc_array_c4 * const restrict retarray,
int blas_limit, blas_call gemm);
export_proto(matmul_c4);
+#pragma GCC optimize ( "-Ofast" )
+#pragma GCC optimize ( "-funroll-loops" )
+
void
matmul_c4 (gfc_array_c4 * const restrict retarray,
gfc_array_c4 * const restrict a, gfc_array_c4 * const restrict b, int try_blas,
@@ -99,7 +102,7 @@ matmul_c4 (gfc_array_c4 * const restrict retarray,
o One-dimensional argument B is implicitly treated as a column matrix
dimensioned [count, 1], so ycount=1.
- */
+*/
if (retarray->base_addr == NULL)
{
@@ -127,47 +130,47 @@ matmul_c4 (gfc_array_c4 * const restrict retarray,
= xmallocarray (size0 ((array_t *) retarray), sizeof (GFC_COMPLEX_4));
retarray->offset = 0;
}
- else if (unlikely (compile_options.bounds_check))
- {
- index_type ret_extent, arg_extent;
-
- if (GFC_DESCRIPTOR_RANK (a) == 1)
- {
- arg_extent = GFC_DESCRIPTOR_EXTENT(b,1);
- ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0);
- if (arg_extent != ret_extent)
- runtime_error ("Incorrect extent in return array in"
- " MATMUL intrinsic: is %ld, should be %ld",
- (long int) ret_extent, (long int) arg_extent);
- }
- else if (GFC_DESCRIPTOR_RANK (b) == 1)
- {
- arg_extent = GFC_DESCRIPTOR_EXTENT(a,0);
- ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0);
- if (arg_extent != ret_extent)
- runtime_error ("Incorrect extent in return array in"
- " MATMUL intrinsic: is %ld, should be %ld",
- (long int) ret_extent, (long int) arg_extent);
- }
- else
- {
- arg_extent = GFC_DESCRIPTOR_EXTENT(a,0);
- ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0);
- if (arg_extent != ret_extent)
- runtime_error ("Incorrect extent in return array in"
- " MATMUL intrinsic for dimension 1:"
- " is %ld, should be %ld",
- (long int) ret_extent, (long int) arg_extent);
-
- arg_extent = GFC_DESCRIPTOR_EXTENT(b,1);
- ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,1);
- if (arg_extent != ret_extent)
- runtime_error ("Incorrect extent in return array in"
- " MATMUL intrinsic for dimension 2:"
- " is %ld, should be %ld",
- (long int) ret_extent, (long int) arg_extent);
- }
- }
+ else if (unlikely (compile_options.bounds_check))
+ {
+ index_type ret_extent, arg_extent;
+
+ if (GFC_DESCRIPTOR_RANK (a) == 1)
+ {
+ arg_extent = GFC_DESCRIPTOR_EXTENT(b,1);
+ ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0);
+ if (arg_extent != ret_extent)
+ runtime_error ("Incorrect extent in return array in"
+ " MATMUL intrinsic: is %ld, should be %ld",
+ (long int) ret_extent, (long int) arg_extent);
+ }
+ else if (GFC_DESCRIPTOR_RANK (b) == 1)
+ {
+ arg_extent = GFC_DESCRIPTOR_EXTENT(a,0);
+ ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0);
+ if (arg_extent != ret_extent)
+ runtime_error ("Incorrect extent in return array in"
+ " MATMUL intrinsic: is %ld, should be %ld",
+ (long int) ret_extent, (long int) arg_extent);
+ }
+ else
+ {
+ arg_extent = GFC_DESCRIPTOR_EXTENT(a,0);
+ ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0);
+ if (arg_extent != ret_extent)
+ runtime_error ("Incorrect extent in return array in"
+ " MATMUL intrinsic for dimension 1:"
+ " is %ld, should be %ld",
+ (long int) ret_extent, (long int) arg_extent);
+
+ arg_extent = GFC_DESCRIPTOR_EXTENT(b,1);
+ ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,1);
+ if (arg_extent != ret_extent)
+ runtime_error ("Incorrect extent in return array in"
+ " MATMUL intrinsic for dimension 2:"
+ " is %ld, should be %ld",
+ (long int) ret_extent, (long int) arg_extent);
+ }
+ }
if (GFC_DESCRIPTOR_RANK (retarray) == 1)
@@ -230,61 +233,294 @@ matmul_c4 (gfc_array_c4 * const restrict retarray,
bbase = b->base_addr;
dest = retarray->base_addr;
-
- /* Now that everything is set up, we're performing the multiplication
+ /* Now that everything is set up, we perform the multiplication
itself. */
#define POW3(x) (((float) (x)) * ((float) (x)) * ((float) (x)))
+#define min(a,b) ((a) <= (b) ? (a) : (b))
+#define max(a,b) ((a) >= (b) ? (a) : (b))
if (try_blas && rxstride == 1 && (axstride == 1 || aystride == 1)
&& (bxstride == 1 || bystride == 1)
&& (((float) xcount) * ((float) ycount) * ((float) count)
> POW3(blas_limit)))
- {
- const int m = xcount, n = ycount, k = count, ldc = rystride;
- const GFC_COMPLEX_4 one = 1, zero = 0;
- const int lda = (axstride == 1) ? aystride : axstride,
- ldb = (bxstride == 1) ? bystride : bxstride;
-
- if (lda > 0 && ldb > 0 && ldc > 0 && m > 1 && n > 1 && k > 1)
- {
- assert (gemm != NULL);
- gemm (axstride == 1 ? "N" : "T", bxstride == 1 ? "N" : "T", &m, &n, &k,
- &one, abase, &lda, bbase, &ldb, &zero, dest, &ldc, 1, 1);
- return;
- }
- }
-
- if (rxstride == 1 && axstride == 1 && bxstride == 1)
{
- const GFC_COMPLEX_4 * restrict bbase_y;
- GFC_COMPLEX_4 * restrict dest_y;
- const GFC_COMPLEX_4 * restrict abase_n;
- GFC_COMPLEX_4 bbase_yn;
+ const int m = xcount, n = ycount, k = count, ldc = rystride;
+ const GFC_COMPLEX_4 one = 1, zero = 0;
+ const int lda = (axstride == 1) ? aystride : axstride,
+ ldb = (bxstride == 1) ? bystride : bxstride;
- if (rystride == xcount)
- memset (dest, 0, (sizeof (GFC_COMPLEX_4) * xcount * ycount));
- else
+ if (lda > 0 && ldb > 0 && ldc > 0 && m > 1 && n > 1 && k > 1)
{
- for (y = 0; y < ycount; y++)
- for (x = 0; x < xcount; x++)
- dest[x + y*rystride] = (GFC_COMPLEX_4)0;
+ assert (gemm != NULL);
+ gemm (axstride == 1 ? "N" : "T", bxstride == 1 ? "N" : "T", &m,
+ &n, &k, &one, abase, &lda, bbase, &ldb, &zero, dest,
+ &ldc, 1, 1);
+ return;
}
+ }
- for (y = 0; y < ycount; y++)
+ if (rxstride == 1 && axstride == 1 && bxstride == 1)
+ {
+ /* This block of code implements a tuned matmul, derived from
+ Superscalar GEMM-based level 3 BLAS, Beta version 0.1
+
+ Bo Kagstrom and Per Ling
+ Department of Computing Science
+ Umea University
+ S-901 87 Umea, Sweden
+
+ from netlib.org, translated to C, and modified for matmul.m4. */
+
+ const GFC_COMPLEX_4 *a, *b;
+ GFC_COMPLEX_4 *c;
+ const index_type m = xcount, n = ycount, k = count;
+
+ /* System generated locals */
+ index_type a_dim1, a_offset, b_dim1, b_offset, c_dim1, c_offset, i1, i2,
+ i3, i4, i5, i6;
+
+ /* Local variables */
+ GFC_COMPLEX_4 t1[65536], /* was [256][256] */
+ f11, f12, f21, f22, f31, f32, f41, f42,
+ f13, f14, f23, f24, f33, f34, f43, f44;
+ index_type i, j, l, ii, jj, ll;
+ index_type isec, jsec, lsec, uisec, ujsec, ulsec;
+
+ a = abase;
+ b = bbase;
+ c = retarray->base_addr;
+
+ /* Parameter adjustments */
+ c_dim1 = m;
+ c_offset = 1 + c_dim1;
+ c -= c_offset;
+ a_dim1 = aystride;
+ a_offset = 1 + a_dim1;
+ a -= a_offset;
+ b_dim1 = bystride;
+ b_offset = 1 + b_dim1;
+ b -= b_offset;
+
+ /* Early exit if possible */
+ if (m == 0 || n == 0 || k == 0)
+ return;
+
+ /* Empty c first. */
+ for (j=1; j<=n; j++)
+ for (i=1; i<=m; i++)
+ c[i + j * c_dim1] = (GFC_COMPLEX_4)0;
+
+ /* Start turning the crank. */
+ i1 = n;
+ for (jj = 1; jj <= i1; jj += 512)
{
- bbase_y = bbase + y*bystride;
- dest_y = dest + y*rystride;
- for (n = 0; n < count; n++)
+ /* Computing MIN */
+ i2 = 512;
+ i3 = n - jj + 1;
+ jsec = min(i2,i3);
+ ujsec = jsec - jsec % 4;
+ i2 = k;
+ for (ll = 1; ll <= i2; ll += 256)
{
- abase_n = abase + n*aystride;
- bbase_yn = bbase_y[n];
- for (x = 0; x < xcount; x++)
+ /* Computing MIN */
+ i3 = 256;
+ i4 = k - ll + 1;
+ lsec = min(i3,i4);
+ ulsec = lsec - lsec % 2;
+
+ i3 = m;
+ for (ii = 1; ii <= i3; ii += 256)
{
- dest_y[x] += abase_n[x] * bbase_yn;
+ /* Computing MIN */
+ i4 = 256;
+ i5 = m - ii + 1;
+ isec = min(i4,i5);
+ uisec = isec - isec % 2;
+ i4 = ll + ulsec - 1;
+ for (l = ll; l <= i4; l += 2)
+ {
+ i5 = ii + uisec - 1;
+ for (i = ii; i <= i5; i += 2)
+ {
+ t1[l - ll + 1 + ((i - ii + 1) << 8) - 257] =
+ a[i + l * a_dim1];
+ t1[l - ll + 2 + ((i - ii + 1) << 8) - 257] =
+ a[i + (l + 1) * a_dim1];
+ t1[l - ll + 1 + ((i - ii + 2) << 8) - 257] =
+ a[i + 1 + l * a_dim1];
+ t1[l - ll + 2 + ((i - ii + 2) << 8) - 257] =
+ a[i + 1 + (l + 1) * a_dim1];
+ }
+ if (uisec < isec)
+ {
+ t1[l - ll + 1 + (isec << 8) - 257] =
+ a[ii + isec - 1 + l * a_dim1];
+ t1[l - ll + 2 + (isec << 8) - 257] =
+ a[ii + isec - 1 + (l + 1) * a_dim1];
+ }
+ }
+ if (ulsec < lsec)
+ {
+ i4 = ii + isec - 1;
+ for (i = ii; i<= i4; ++i)
+ {
+ t1[lsec + ((i - ii + 1) << 8) - 257] =
+ a[i + (ll + lsec - 1) * a_dim1];
+ }
+ }
+
+ uisec = isec - isec % 4;
+ i4 = jj + ujsec - 1;
+ for (j = jj; j <= i4; j += 4)
+ {
+ i5 = ii + uisec - 1;
+ for (i = ii; i <= i5; i += 4)
+ {
+ f11 = c[i + j * c_dim1];
+ f21 = c[i + 1 + j * c_dim1];
+ f12 = c[i + (j + 1) * c_dim1];
+ f22 = c[i + 1 + (j + 1) * c_dim1];
+ f13 = c[i + (j + 2) * c_dim1];
+ f23 = c[i + 1 + (j + 2) * c_dim1];
+ f14 = c[i + (j + 3) * c_dim1];
+ f24 = c[i + 1 + (j + 3) * c_dim1];
+ f31 = c[i + 2 + j * c_dim1];
+ f41 = c[i + 3 + j * c_dim1];
+ f32 = c[i + 2 + (j + 1) * c_dim1];
+ f42 = c[i + 3 + (j + 1) * c_dim1];
+ f33 = c[i + 2 + (j + 2) * c_dim1];
+ f43 = c[i + 3 + (j + 2) * c_dim1];
+ f34 = c[i + 2 + (j + 3) * c_dim1];
+ f44 = c[i + 3 + (j + 3) * c_dim1];
+ i6 = ll + lsec - 1;
+ for (l = ll; l <= i6; ++l)
+ {
+ f11 += t1[l - ll + 1 + ((i - ii + 1) << 8) - 257]
+ * b[l + j * b_dim1];
+ f21 += t1[l - ll + 1 + ((i - ii + 2) << 8) - 257]
+ * b[l + j * b_dim1];
+ f12 += t1[l - ll + 1 + ((i - ii + 1) << 8) - 257]
+ * b[l + (j + 1) * b_dim1];
+ f22 += t1[l - ll + 1 + ((i - ii + 2) << 8) - 257]
+ * b[l + (j + 1) * b_dim1];
+ f13 += t1[l - ll + 1 + ((i - ii + 1) << 8) - 257]
+ * b[l + (j + 2) * b_dim1];
+ f23 += t1[l - ll + 1 + ((i - ii + 2) << 8) - 257]
+ * b[l + (j + 2) * b_dim1];
+ f14 += t1[l - ll + 1 + ((i - ii + 1) << 8) - 257]
+ * b[l + (j + 3) * b_dim1];
+ f24 += t1[l - ll + 1 + ((i - ii + 2) << 8) - 257]
+ * b[l + (j + 3) * b_dim1];
+ f31 += t1[l - ll + 1 + ((i - ii + 3) << 8) - 257]
+ * b[l + j * b_dim1];
+ f41 += t1[l - ll + 1 + ((i - ii + 4) << 8) - 257]
+ * b[l + j * b_dim1];
+ f32 += t1[l - ll + 1 + ((i - ii + 3) << 8) - 257]
+ * b[l + (j + 1) * b_dim1];
+ f42 += t1[l - ll + 1 + ((i - ii + 4) << 8) - 257]
+ * b[l + (j + 1) * b_dim1];
+ f33 += t1[l - ll + 1 + ((i - ii + 3) << 8) - 257]
+ * b[l + (j + 2) * b_dim1];
+ f43 += t1[l - ll + 1 + ((i - ii + 4) << 8) - 257]
+ * b[l + (j + 2) * b_dim1];
+ f34 += t1[l - ll + 1 + ((i - ii + 3) << 8) - 257]
+ * b[l + (j + 3) * b_dim1];
+ f44 += t1[l - ll + 1 + ((i - ii + 4) << 8) - 257]
+ * b[l + (j + 3) * b_dim1];
+ }
+ c[i + j * c_dim1] = f11;
+ c[i + 1 + j * c_dim1] = f21;
+ c[i + (j + 1) * c_dim1] = f12;
+ c[i + 1 + (j + 1) * c_dim1] = f22;
+ c[i + (j + 2) * c_dim1] = f13;
+ c[i + 1 + (j + 2) * c_dim1] = f23;
+ c[i + (j + 3) * c_dim1] = f14;
+ c[i + 1 + (j + 3) * c_dim1] = f24;
+ c[i + 2 + j * c_dim1] = f31;
+ c[i + 3 + j * c_dim1] = f41;
+ c[i + 2 + (j + 1) * c_dim1] = f32;
+ c[i + 3 + (j + 1) * c_dim1] = f42;
+ c[i + 2 + (j + 2) * c_dim1] = f33;
+ c[i + 3 + (j + 2) * c_dim1] = f43;
+ c[i + 2 + (j + 3) * c_dim1] = f34;
+ c[i + 3 + (j + 3) * c_dim1] = f44;
+ }
+ if (uisec < isec)
+ {
+ i5 = ii + isec - 1;
+ for (i = ii + uisec; i <= i5; ++i)
+ {
+ f11 = c[i + j * c_dim1];
+ f12 = c[i + (j + 1) * c_dim1];
+ f13 = c[i + (j + 2) * c_dim1];
+ f14 = c[i + (j + 3) * c_dim1];
+ i6 = ll + lsec - 1;
+ for (l = ll; l <= i6; ++l)
+ {
+ f11 += t1[l - ll + 1 + ((i - ii + 1) << 8) -
+ 257] * b[l + j * b_dim1];
+ f12 += t1[l - ll + 1 + ((i - ii + 1) << 8) -
+ 257] * b[l + (j + 1) * b_dim1];
+ f13 += t1[l - ll + 1 + ((i - ii + 1) << 8) -
+ 257] * b[l + (j + 2) * b_dim1];
+ f14 += t1[l - ll + 1 + ((i - ii + 1) << 8) -
+ 257] * b[l + (j + 3) * b_dim1];
+ }
+ c[i + j * c_dim1] = f11;
+ c[i + (j + 1) * c_dim1] = f12;
+ c[i + (j + 2) * c_dim1] = f13;
+ c[i + (j + 3) * c_dim1] = f14;
+ }
+ }
+ }
+ if (ujsec < jsec)
+ {
+ i4 = jj + jsec - 1;
+ for (j = jj + ujsec; j <= i4; ++j)
+ {
+ i5 = ii + uisec - 1;
+ for (i = ii; i <= i5; i += 4)
+ {
+ f11 = c[i + j * c_dim1];
+ f21 = c[i + 1 + j * c_dim1];
+ f31 = c[i + 2 + j * c_dim1];
+ f41 = c[i + 3 + j * c_dim1];
+ i6 = ll + lsec - 1;
+ for (l = ll; l <= i6; ++l)
+ {
+ f11 += t1[l - ll + 1 + ((i - ii + 1) << 8) -
+ 257] * b[l + j * b_dim1];
+ f21 += t1[l - ll + 1 + ((i - ii + 2) << 8) -
+ 257] * b[l + j * b_dim1];
+ f31 += t1[l - ll + 1 + ((i - ii + 3) << 8) -
+ 257] * b[l + j * b_dim1];
+ f41 += t1[l - ll + 1 + ((i - ii + 4) << 8) -
+ 257] * b[l + j * b_dim1];
+ }
+ c[i + j * c_dim1] = f11;
+ c[i + 1 + j * c_dim1] = f21;
+ c[i + 2 + j * c_dim1] = f31;
+ c[i + 3 + j * c_dim1] = f41;
+ }
+ i5 = ii + isec - 1;
+ for (i = ii + uisec; i <= i5; ++i)
+ {
+ f11 = c[i + j * c_dim1];
+ i6 = ll + lsec - 1;
+ for (l = ll; l <= i6; ++l)
+ {
+ f11 += t1[l - ll + 1 + ((i - ii + 1) << 8) -
+ 257] * b[l + j * b_dim1];
+ }
+ c[i + j * c_dim1] = f11;
+ }
+ }
+ }
}
}
}
+ return;
}
else if (rxstride == 1 && aystride == 1 && bxstride == 1)
{
@@ -334,7 +570,9 @@ matmul_c4 (gfc_array_c4 * const restrict retarray,
for (n = 0; n < count; n++)
for (x = 0; x < xcount; x++)
/* dest[x,y] += a[x,n] * b[n,y] */
- dest[x*rxstride + y*rystride] += abase[x*axstride + n*aystride] * bbase[n*bxstride + y*bystride];
+ dest[x*rxstride + y*rystride] +=
+ abase[x*axstride + n*aystride] *
+ bbase[n*bxstride + y*bystride];
}
else if (GFC_DESCRIPTOR_RANK (a) == 1)
{
@@ -372,5 +610,5 @@ matmul_c4 (gfc_array_c4 * const restrict retarray,
}
}
}
-
+#pragma GCC reset_options
#endif
diff --git a/libgfortran/generated/matmul_c8.c b/libgfortran/generated/matmul_c8.c
index 8a127da..efd5623 100644
--- a/libgfortran/generated/matmul_c8.c
+++ b/libgfortran/generated/matmul_c8.c
@@ -32,7 +32,7 @@ see the files COPYING3 and COPYING.RUNTIME respectively. If not, see
#if defined (HAVE_GFC_COMPLEX_8)
/* Prototype for the BLAS ?gemm subroutine, a pointer to which can be
- passed to us by the front-end, in which case we'll call it for large
+ passed to us by the front-end, in which case we call it for large
matrices. */
typedef void (*blas_call)(const char *, const char *, const int *, const int *,
@@ -75,6 +75,9 @@ extern void matmul_c8 (gfc_array_c8 * const restrict retarray,
int blas_limit, blas_call gemm);
export_proto(matmul_c8);
+#pragma GCC optimize ( "-Ofast" )
+#pragma GCC optimize ( "-funroll-loops" )
+
void
matmul_c8 (gfc_array_c8 * const restrict retarray,
gfc_array_c8 * const restrict a, gfc_array_c8 * const restrict b, int try_blas,
@@ -99,7 +102,7 @@ matmul_c8 (gfc_array_c8 * const restrict retarray,
o One-dimensional argument B is implicitly treated as a column matrix
dimensioned [count, 1], so ycount=1.
- */
+*/
if (retarray->base_addr == NULL)
{
@@ -127,47 +130,47 @@ matmul_c8 (gfc_array_c8 * const restrict retarray,
= xmallocarray (size0 ((array_t *) retarray), sizeof (GFC_COMPLEX_8));
retarray->offset = 0;
}
- else if (unlikely (compile_options.bounds_check))
- {
- index_type ret_extent, arg_extent;
-
- if (GFC_DESCRIPTOR_RANK (a) == 1)
- {
- arg_extent = GFC_DESCRIPTOR_EXTENT(b,1);
- ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0);
- if (arg_extent != ret_extent)
- runtime_error ("Incorrect extent in return array in"
- " MATMUL intrinsic: is %ld, should be %ld",
- (long int) ret_extent, (long int) arg_extent);
- }
- else if (GFC_DESCRIPTOR_RANK (b) == 1)
- {
- arg_extent = GFC_DESCRIPTOR_EXTENT(a,0);
- ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0);
- if (arg_extent != ret_extent)
- runtime_error ("Incorrect extent in return array in"
- " MATMUL intrinsic: is %ld, should be %ld",
- (long int) ret_extent, (long int) arg_extent);
- }
- else
- {
- arg_extent = GFC_DESCRIPTOR_EXTENT(a,0);
- ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0);
- if (arg_extent != ret_extent)
- runtime_error ("Incorrect extent in return array in"
- " MATMUL intrinsic for dimension 1:"
- " is %ld, should be %ld",
- (long int) ret_extent, (long int) arg_extent);
-
- arg_extent = GFC_DESCRIPTOR_EXTENT(b,1);
- ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,1);
- if (arg_extent != ret_extent)
- runtime_error ("Incorrect extent in return array in"
- " MATMUL intrinsic for dimension 2:"
- " is %ld, should be %ld",
- (long int) ret_extent, (long int) arg_extent);
- }
- }
+ else if (unlikely (compile_options.bounds_check))
+ {
+ index_type ret_extent, arg_extent;
+
+ if (GFC_DESCRIPTOR_RANK (a) == 1)
+ {
+ arg_extent = GFC_DESCRIPTOR_EXTENT(b,1);
+ ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0);
+ if (arg_extent != ret_extent)
+ runtime_error ("Incorrect extent in return array in"
+ " MATMUL intrinsic: is %ld, should be %ld",
+ (long int) ret_extent, (long int) arg_extent);
+ }
+ else if (GFC_DESCRIPTOR_RANK (b) == 1)
+ {
+ arg_extent = GFC_DESCRIPTOR_EXTENT(a,0);
+ ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0);
+ if (arg_extent != ret_extent)
+ runtime_error ("Incorrect extent in return array in"
+ " MATMUL intrinsic: is %ld, should be %ld",
+ (long int) ret_extent, (long int) arg_extent);
+ }
+ else
+ {
+ arg_extent = GFC_DESCRIPTOR_EXTENT(a,0);
+ ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0);
+ if (arg_extent != ret_extent)
+ runtime_error ("Incorrect extent in return array in"
+ " MATMUL intrinsic for dimension 1:"
+ " is %ld, should be %ld",
+ (long int) ret_extent, (long int) arg_extent);
+
+ arg_extent = GFC_DESCRIPTOR_EXTENT(b,1);
+ ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,1);
+ if (arg_extent != ret_extent)
+ runtime_error ("Incorrect extent in return array in"
+ " MATMUL intrinsic for dimension 2:"
+ " is %ld, should be %ld",
+ (long int) ret_extent, (long int) arg_extent);
+ }
+ }
if (GFC_DESCRIPTOR_RANK (retarray) == 1)
@@ -230,61 +233,294 @@ matmul_c8 (gfc_array_c8 * const restrict retarray,
bbase = b->base_addr;
dest = retarray->base_addr;
-
- /* Now that everything is set up, we're performing the multiplication
+ /* Now that everything is set up, we perform the multiplication
itself. */
#define POW3(x) (((float) (x)) * ((float) (x)) * ((float) (x)))
+#define min(a,b) ((a) <= (b) ? (a) : (b))
+#define max(a,b) ((a) >= (b) ? (a) : (b))
if (try_blas && rxstride == 1 && (axstride == 1 || aystride == 1)
&& (bxstride == 1 || bystride == 1)
&& (((float) xcount) * ((float) ycount) * ((float) count)
> POW3(blas_limit)))
- {
- const int m = xcount, n = ycount, k = count, ldc = rystride;
- const GFC_COMPLEX_8 one = 1, zero = 0;
- const int lda = (axstride == 1) ? aystride : axstride,
- ldb = (bxstride == 1) ? bystride : bxstride;
-
- if (lda > 0 && ldb > 0 && ldc > 0 && m > 1 && n > 1 && k > 1)
- {
- assert (gemm != NULL);
- gemm (axstride == 1 ? "N" : "T", bxstride == 1 ? "N" : "T", &m, &n, &k,
- &one, abase, &lda, bbase, &ldb, &zero, dest, &ldc, 1, 1);
- return;
- }
- }
-
- if (rxstride == 1 && axstride == 1 && bxstride == 1)
{
- const GFC_COMPLEX_8 * restrict bbase_y;
- GFC_COMPLEX_8 * restrict dest_y;
- const GFC_COMPLEX_8 * restrict abase_n;
- GFC_COMPLEX_8 bbase_yn;
+ const int m = xcount, n = ycount, k = count, ldc = rystride;
+ const GFC_COMPLEX_8 one = 1, zero = 0;
+ const int lda = (axstride == 1) ? aystride : axstride,
+ ldb = (bxstride == 1) ? bystride : bxstride;
- if (rystride == xcount)
- memset (dest, 0, (sizeof (GFC_COMPLEX_8) * xcount * ycount));
- else
+ if (lda > 0 && ldb > 0 && ldc > 0 && m > 1 && n > 1 && k > 1)
{
- for (y = 0; y < ycount; y++)
- for (x = 0; x < xcount; x++)
- dest[x + y*rystride] = (GFC_COMPLEX_8)0;
+ assert (gemm != NULL);
+ gemm (axstride == 1 ? "N" : "T", bxstride == 1 ? "N" : "T", &m,
+ &n, &k, &one, abase, &lda, bbase, &ldb, &zero, dest,
+ &ldc, 1, 1);
+ return;
}
+ }
- for (y = 0; y < ycount; y++)
+ if (rxstride == 1 && axstride == 1 && bxstride == 1)
+ {
+ /* This block of code implements a tuned matmul, derived from
+ Superscalar GEMM-based level 3 BLAS, Beta version 0.1
+
+ Bo Kagstrom and Per Ling
+ Department of Computing Science
+ Umea University
+ S-901 87 Umea, Sweden
+
+ from netlib.org, translated to C, and modified for matmul.m4. */
+
+ const GFC_COMPLEX_8 *a, *b;
+ GFC_COMPLEX_8 *c;
+ const index_type m = xcount, n = ycount, k = count;
+
+ /* System generated locals */
+ index_type a_dim1, a_offset, b_dim1, b_offset, c_dim1, c_offset, i1, i2,
+ i3, i4, i5, i6;
+
+ /* Local variables */
+ GFC_COMPLEX_8 t1[65536], /* was [256][256] */
+ f11, f12, f21, f22, f31, f32, f41, f42,
+ f13, f14, f23, f24, f33, f34, f43, f44;
+ index_type i, j, l, ii, jj, ll;
+ index_type isec, jsec, lsec, uisec, ujsec, ulsec;
+
+ a = abase;
+ b = bbase;
+ c = retarray->base_addr;
+
+ /* Parameter adjustments */
+ c_dim1 = m;
+ c_offset = 1 + c_dim1;
+ c -= c_offset;
+ a_dim1 = aystride;
+ a_offset = 1 + a_dim1;
+ a -= a_offset;
+ b_dim1 = bystride;
+ b_offset = 1 + b_dim1;
+ b -= b_offset;
+
+ /* Early exit if possible */
+ if (m == 0 || n == 0 || k == 0)
+ return;
+
+ /* Empty c first. */
+ for (j=1; j<=n; j++)
+ for (i=1; i<=m; i++)
+ c[i + j * c_dim1] = (GFC_COMPLEX_8)0;
+
+ /* Start turning the crank. */
+ i1 = n;
+ for (jj = 1; jj <= i1; jj += 512)
{
- bbase_y = bbase + y*bystride;
- dest_y = dest + y*rystride;
- for (n = 0; n < count; n++)
+ /* Computing MIN */
+ i2 = 512;
+ i3 = n - jj + 1;
+ jsec = min(i2,i3);
+ ujsec = jsec - jsec % 4;
+ i2 = k;
+ for (ll = 1; ll <= i2; ll += 256)
{
- abase_n = abase + n*aystride;
- bbase_yn = bbase_y[n];
- for (x = 0; x < xcount; x++)
+ /* Computing MIN */
+ i3 = 256;
+ i4 = k - ll + 1;
+ lsec = min(i3,i4);
+ ulsec = lsec - lsec % 2;
+
+ i3 = m;
+ for (ii = 1; ii <= i3; ii += 256)
{
- dest_y[x] += abase_n[x] * bbase_yn;
+ /* Computing MIN */
+ i4 = 256;
+ i5 = m - ii + 1;
+ isec = min(i4,i5);
+ uisec = isec - isec % 2;
+ i4 = ll + ulsec - 1;
+ for (l = ll; l <= i4; l += 2)
+ {
+ i5 = ii + uisec - 1;
+ for (i = ii; i <= i5; i += 2)
+ {
+ t1[l - ll + 1 + ((i - ii + 1) << 8) - 257] =
+ a[i + l * a_dim1];
+ t1[l - ll + 2 + ((i - ii + 1) << 8) - 257] =
+ a[i + (l + 1) * a_dim1];
+ t1[l - ll + 1 + ((i - ii + 2) << 8) - 257] =
+ a[i + 1 + l * a_dim1];
+ t1[l - ll + 2 + ((i - ii + 2) << 8) - 257] =
+ a[i + 1 + (l + 1) * a_dim1];
+ }
+ if (uisec < isec)
+ {
+ t1[l - ll + 1 + (isec << 8) - 257] =
+ a[ii + isec - 1 + l * a_dim1];
+ t1[l - ll + 2 + (isec << 8) - 257] =
+ a[ii + isec - 1 + (l + 1) * a_dim1];
+ }
+ }
+ if (ulsec < lsec)
+ {
+ i4 = ii + isec - 1;
+ for (i = ii; i<= i4; ++i)
+ {
+ t1[lsec + ((i - ii + 1) << 8) - 257] =
+ a[i + (ll + lsec - 1) * a_dim1];
+ }
+ }
+
+ uisec = isec - isec % 4;
+ i4 = jj + ujsec - 1;
+ for (j = jj; j <= i4; j += 4)
+ {
+ i5 = ii + uisec - 1;
+ for (i = ii; i <= i5; i += 4)
+ {
+ f11 = c[i + j * c_dim1];
+ f21 = c[i + 1 + j * c_dim1];
+ f12 = c[i + (j + 1) * c_dim1];
+ f22 = c[i + 1 + (j + 1) * c_dim1];
+ f13 = c[i + (j + 2) * c_dim1];
+ f23 = c[i + 1 + (j + 2) * c_dim1];
+ f14 = c[i + (j + 3) * c_dim1];
+ f24 = c[i + 1 + (j + 3) * c_dim1];
+ f31 = c[i + 2 + j * c_dim1];
+ f41 = c[i + 3 + j * c_dim1];
+ f32 = c[i + 2 + (j + 1) * c_dim1];
+ f42 = c[i + 3 + (j + 1) * c_dim1];
+ f33 = c[i + 2 + (j + 2) * c_dim1];
+ f43 = c[i + 3 + (j + 2) * c_dim1];
+ f34 = c[i + 2 + (j + 3) * c_dim1];
+ f44 = c[i + 3 + (j + 3) * c_dim1];
+ i6 = ll + lsec - 1;
+ for (l = ll; l <= i6; ++l)
+ {
+ f11 += t1[l - ll + 1 + ((i - ii + 1) << 8) - 257]
+ * b[l + j * b_dim1];
+ f21 += t1[l - ll + 1 + ((i - ii + 2) << 8) - 257]
+ * b[l + j * b_dim1];
+ f12 += t1[l - ll + 1 + ((i - ii + 1) << 8) - 257]
+ * b[l + (j + 1) * b_dim1];
+ f22 += t1[l - ll + 1 + ((i - ii + 2) << 8) - 257]
+ * b[l + (j + 1) * b_dim1];
+ f13 += t1[l - ll + 1 + ((i - ii + 1) << 8) - 257]
+ * b[l + (j + 2) * b_dim1];
+ f23 += t1[l - ll + 1 + ((i - ii + 2) << 8) - 257]
+ * b[l + (j + 2) * b_dim1];
+ f14 += t1[l - ll + 1 + ((i - ii + 1) << 8) - 257]
+ * b[l + (j + 3) * b_dim1];
+ f24 += t1[l - ll + 1 + ((i - ii + 2) << 8) - 257]
+ * b[l + (j + 3) * b_dim1];
+ f31 += t1[l - ll + 1 + ((i - ii + 3) << 8) - 257]
+ * b[l + j * b_dim1];
+ f41 += t1[l - ll + 1 + ((i - ii + 4) << 8) - 257]
+ * b[l + j * b_dim1];
+ f32 += t1[l - ll + 1 + ((i - ii + 3) << 8) - 257]
+ * b[l + (j + 1) * b_dim1];
+ f42 += t1[l - ll + 1 + ((i - ii + 4) << 8) - 257]
+ * b[l + (j + 1) * b_dim1];
+ f33 += t1[l - ll + 1 + ((i - ii + 3) << 8) - 257]
+ * b[l + (j + 2) * b_dim1];
+ f43 += t1[l - ll + 1 + ((i - ii + 4) << 8) - 257]
+ * b[l + (j + 2) * b_dim1];
+ f34 += t1[l - ll + 1 + ((i - ii + 3) << 8) - 257]
+ * b[l + (j + 3) * b_dim1];
+ f44 += t1[l - ll + 1 + ((i - ii + 4) << 8) - 257]
+ * b[l + (j + 3) * b_dim1];
+ }
+ c[i + j * c_dim1] = f11;
+ c[i + 1 + j * c_dim1] = f21;
+ c[i + (j + 1) * c_dim1] = f12;
+ c[i + 1 + (j + 1) * c_dim1] = f22;
+ c[i + (j + 2) * c_dim1] = f13;
+ c[i + 1 + (j + 2) * c_dim1] = f23;
+ c[i + (j + 3) * c_dim1] = f14;
+ c[i + 1 + (j + 3) * c_dim1] = f24;
+ c[i + 2 + j * c_dim1] = f31;
+ c[i + 3 + j * c_dim1] = f41;
+ c[i + 2 + (j + 1) * c_dim1] = f32;
+ c[i + 3 + (j + 1) * c_dim1] = f42;
+ c[i + 2 + (j + 2) * c_dim1] = f33;
+ c[i + 3 + (j + 2) * c_dim1] = f43;
+ c[i + 2 + (j + 3) * c_dim1] = f34;
+ c[i + 3 + (j + 3) * c_dim1] = f44;
+ }
+ if (uisec < isec)
+ {
+ i5 = ii + isec - 1;
+ for (i = ii + uisec; i <= i5; ++i)
+ {
+ f11 = c[i + j * c_dim1];
+ f12 = c[i + (j + 1) * c_dim1];
+ f13 = c[i + (j + 2) * c_dim1];
+ f14 = c[i + (j + 3) * c_dim1];
+ i6 = ll + lsec - 1;
+ for (l = ll; l <= i6; ++l)
+ {
+ f11 += t1[l - ll + 1 + ((i - ii + 1) << 8) -
+ 257] * b[l + j * b_dim1];
+ f12 += t1[l - ll + 1 + ((i - ii + 1) << 8) -
+ 257] * b[l + (j + 1) * b_dim1];
+ f13 += t1[l - ll + 1 + ((i - ii + 1) << 8) -
+ 257] * b[l + (j + 2) * b_dim1];
+ f14 += t1[l - ll + 1 + ((i - ii + 1) << 8) -
+ 257] * b[l + (j + 3) * b_dim1];
+ }
+ c[i + j * c_dim1] = f11;
+ c[i + (j + 1) * c_dim1] = f12;
+ c[i + (j + 2) * c_dim1] = f13;
+ c[i + (j + 3) * c_dim1] = f14;
+ }
+ }
+ }
+ if (ujsec < jsec)
+ {
+ i4 = jj + jsec - 1;
+ for (j = jj + ujsec; j <= i4; ++j)
+ {
+ i5 = ii + uisec - 1;
+ for (i = ii; i <= i5; i += 4)
+ {
+ f11 = c[i + j * c_dim1];
+ f21 = c[i + 1 + j * c_dim1];
+ f31 = c[i + 2 + j * c_dim1];
+ f41 = c[i + 3 + j * c_dim1];
+ i6 = ll + lsec - 1;
+ for (l = ll; l <= i6; ++l)
+ {
+ f11 += t1[l - ll + 1 + ((i - ii + 1) << 8) -
+ 257] * b[l + j * b_dim1];
+ f21 += t1[l - ll + 1 + ((i - ii + 2) << 8) -
+ 257] * b[l + j * b_dim1];
+ f31 += t1[l - ll + 1 + ((i - ii + 3) << 8) -
+ 257] * b[l + j * b_dim1];
+ f41 += t1[l - ll + 1 + ((i - ii + 4) << 8) -
+ 257] * b[l + j * b_dim1];
+ }
+ c[i + j * c_dim1] = f11;
+ c[i + 1 + j * c_dim1] = f21;
+ c[i + 2 + j * c_dim1] = f31;
+ c[i + 3 + j * c_dim1] = f41;
+ }
+ i5 = ii + isec - 1;
+ for (i = ii + uisec; i <= i5; ++i)
+ {
+ f11 = c[i + j * c_dim1];
+ i6 = ll + lsec - 1;
+ for (l = ll; l <= i6; ++l)
+ {
+ f11 += t1[l - ll + 1 + ((i - ii + 1) << 8) -
+ 257] * b[l + j * b_dim1];
+ }
+ c[i + j * c_dim1] = f11;
+ }
+ }
+ }
}
}
}
+ return;
}
else if (rxstride == 1 && aystride == 1 && bxstride == 1)
{
@@ -334,7 +570,9 @@ matmul_c8 (gfc_array_c8 * const restrict retarray,
for (n = 0; n < count; n++)
for (x = 0; x < xcount; x++)
/* dest[x,y] += a[x,n] * b[n,y] */
- dest[x*rxstride + y*rystride] += abase[x*axstride + n*aystride] * bbase[n*bxstride + y*bystride];
+ dest[x*rxstride + y*rystride] +=
+ abase[x*axstride + n*aystride] *
+ bbase[n*bxstride + y*bystride];
}
else if (GFC_DESCRIPTOR_RANK (a) == 1)
{
@@ -372,5 +610,5 @@ matmul_c8 (gfc_array_c8 * const restrict retarray,
}
}
}
-
+#pragma GCC reset_options
#endif
diff --git a/libgfortran/generated/matmul_i1.c b/libgfortran/generated/matmul_i1.c
index fdb3092..58d12ab 100644
--- a/libgfortran/generated/matmul_i1.c
+++ b/libgfortran/generated/matmul_i1.c
@@ -32,7 +32,7 @@ see the files COPYING3 and COPYING.RUNTIME respectively. If not, see
#if defined (HAVE_GFC_INTEGER_1)
/* Prototype for the BLAS ?gemm subroutine, a pointer to which can be
- passed to us by the front-end, in which case we'll call it for large
+ passed to us by the front-end, in which case we call it for large
matrices. */
typedef void (*blas_call)(const char *, const char *, const int *, const int *,
@@ -75,6 +75,9 @@ extern void matmul_i1 (gfc_array_i1 * const restrict retarray,
int blas_limit, blas_call gemm);
export_proto(matmul_i1);
+#pragma GCC optimize ( "-Ofast" )
+#pragma GCC optimize ( "-funroll-loops" )
+
void
matmul_i1 (gfc_array_i1 * const restrict retarray,
gfc_array_i1 * const restrict a, gfc_array_i1 * const restrict b, int try_blas,
@@ -99,7 +102,7 @@ matmul_i1 (gfc_array_i1 * const restrict retarray,
o One-dimensional argument B is implicitly treated as a column matrix
dimensioned [count, 1], so ycount=1.
- */
+*/
if (retarray->base_addr == NULL)
{
@@ -127,47 +130,47 @@ matmul_i1 (gfc_array_i1 * const restrict retarray,
= xmallocarray (size0 ((array_t *) retarray), sizeof (GFC_INTEGER_1));
retarray->offset = 0;
}
- else if (unlikely (compile_options.bounds_check))
- {
- index_type ret_extent, arg_extent;
-
- if (GFC_DESCRIPTOR_RANK (a) == 1)
- {
- arg_extent = GFC_DESCRIPTOR_EXTENT(b,1);
- ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0);
- if (arg_extent != ret_extent)
- runtime_error ("Incorrect extent in return array in"
- " MATMUL intrinsic: is %ld, should be %ld",
- (long int) ret_extent, (long int) arg_extent);
- }
- else if (GFC_DESCRIPTOR_RANK (b) == 1)
- {
- arg_extent = GFC_DESCRIPTOR_EXTENT(a,0);
- ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0);
- if (arg_extent != ret_extent)
- runtime_error ("Incorrect extent in return array in"
- " MATMUL intrinsic: is %ld, should be %ld",
- (long int) ret_extent, (long int) arg_extent);
- }
- else
- {
- arg_extent = GFC_DESCRIPTOR_EXTENT(a,0);
- ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0);
- if (arg_extent != ret_extent)
- runtime_error ("Incorrect extent in return array in"
- " MATMUL intrinsic for dimension 1:"
- " is %ld, should be %ld",
- (long int) ret_extent, (long int) arg_extent);
-
- arg_extent = GFC_DESCRIPTOR_EXTENT(b,1);
- ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,1);
- if (arg_extent != ret_extent)
- runtime_error ("Incorrect extent in return array in"
- " MATMUL intrinsic for dimension 2:"
- " is %ld, should be %ld",
- (long int) ret_extent, (long int) arg_extent);
- }
- }
+ else if (unlikely (compile_options.bounds_check))
+ {
+ index_type ret_extent, arg_extent;
+
+ if (GFC_DESCRIPTOR_RANK (a) == 1)
+ {
+ arg_extent = GFC_DESCRIPTOR_EXTENT(b,1);
+ ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0);
+ if (arg_extent != ret_extent)
+ runtime_error ("Incorrect extent in return array in"
+ " MATMUL intrinsic: is %ld, should be %ld",
+ (long int) ret_extent, (long int) arg_extent);
+ }
+ else if (GFC_DESCRIPTOR_RANK (b) == 1)
+ {
+ arg_extent = GFC_DESCRIPTOR_EXTENT(a,0);
+ ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0);
+ if (arg_extent != ret_extent)
+ runtime_error ("Incorrect extent in return array in"
+ " MATMUL intrinsic: is %ld, should be %ld",
+ (long int) ret_extent, (long int) arg_extent);
+ }
+ else
+ {
+ arg_extent = GFC_DESCRIPTOR_EXTENT(a,0);
+ ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0);
+ if (arg_extent != ret_extent)
+ runtime_error ("Incorrect extent in return array in"
+ " MATMUL intrinsic for dimension 1:"
+ " is %ld, should be %ld",
+ (long int) ret_extent, (long int) arg_extent);
+
+ arg_extent = GFC_DESCRIPTOR_EXTENT(b,1);
+ ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,1);
+ if (arg_extent != ret_extent)
+ runtime_error ("Incorrect extent in return array in"
+ " MATMUL intrinsic for dimension 2:"
+ " is %ld, should be %ld",
+ (long int) ret_extent, (long int) arg_extent);
+ }
+ }
if (GFC_DESCRIPTOR_RANK (retarray) == 1)
@@ -230,61 +233,294 @@ matmul_i1 (gfc_array_i1 * const restrict retarray,
bbase = b->base_addr;
dest = retarray->base_addr;
-
- /* Now that everything is set up, we're performing the multiplication
+ /* Now that everything is set up, we perform the multiplication
itself. */
#define POW3(x) (((float) (x)) * ((float) (x)) * ((float) (x)))
+#define min(a,b) ((a) <= (b) ? (a) : (b))
+#define max(a,b) ((a) >= (b) ? (a) : (b))
if (try_blas && rxstride == 1 && (axstride == 1 || aystride == 1)
&& (bxstride == 1 || bystride == 1)
&& (((float) xcount) * ((float) ycount) * ((float) count)
> POW3(blas_limit)))
- {
- const int m = xcount, n = ycount, k = count, ldc = rystride;
- const GFC_INTEGER_1 one = 1, zero = 0;
- const int lda = (axstride == 1) ? aystride : axstride,
- ldb = (bxstride == 1) ? bystride : bxstride;
-
- if (lda > 0 && ldb > 0 && ldc > 0 && m > 1 && n > 1 && k > 1)
- {
- assert (gemm != NULL);
- gemm (axstride == 1 ? "N" : "T", bxstride == 1 ? "N" : "T", &m, &n, &k,
- &one, abase, &lda, bbase, &ldb, &zero, dest, &ldc, 1, 1);
- return;
- }
- }
-
- if (rxstride == 1 && axstride == 1 && bxstride == 1)
{
- const GFC_INTEGER_1 * restrict bbase_y;
- GFC_INTEGER_1 * restrict dest_y;
- const GFC_INTEGER_1 * restrict abase_n;
- GFC_INTEGER_1 bbase_yn;
+ const int m = xcount, n = ycount, k = count, ldc = rystride;
+ const GFC_INTEGER_1 one = 1, zero = 0;
+ const int lda = (axstride == 1) ? aystride : axstride,
+ ldb = (bxstride == 1) ? bystride : bxstride;
- if (rystride == xcount)
- memset (dest, 0, (sizeof (GFC_INTEGER_1) * xcount * ycount));
- else
+ if (lda > 0 && ldb > 0 && ldc > 0 && m > 1 && n > 1 && k > 1)
{
- for (y = 0; y < ycount; y++)
- for (x = 0; x < xcount; x++)
- dest[x + y*rystride] = (GFC_INTEGER_1)0;
+ assert (gemm != NULL);
+ gemm (axstride == 1 ? "N" : "T", bxstride == 1 ? "N" : "T", &m,
+ &n, &k, &one, abase, &lda, bbase, &ldb, &zero, dest,
+ &ldc, 1, 1);
+ return;
}
+ }
- for (y = 0; y < ycount; y++)
+ if (rxstride == 1 && axstride == 1 && bxstride == 1)
+ {
+ /* This block of code implements a tuned matmul, derived from
+ Superscalar GEMM-based level 3 BLAS, Beta version 0.1
+
+ Bo Kagstrom and Per Ling
+ Department of Computing Science
+ Umea University
+ S-901 87 Umea, Sweden
+
+ from netlib.org, translated to C, and modified for matmul.m4. */
+
+ const GFC_INTEGER_1 *a, *b;
+ GFC_INTEGER_1 *c;
+ const index_type m = xcount, n = ycount, k = count;
+
+ /* System generated locals */
+ index_type a_dim1, a_offset, b_dim1, b_offset, c_dim1, c_offset, i1, i2,
+ i3, i4, i5, i6;
+
+ /* Local variables */
+ GFC_INTEGER_1 t1[65536], /* was [256][256] */
+ f11, f12, f21, f22, f31, f32, f41, f42,
+ f13, f14, f23, f24, f33, f34, f43, f44;
+ index_type i, j, l, ii, jj, ll;
+ index_type isec, jsec, lsec, uisec, ujsec, ulsec;
+
+ a = abase;
+ b = bbase;
+ c = retarray->base_addr;
+
+ /* Parameter adjustments */
+ c_dim1 = m;
+ c_offset = 1 + c_dim1;
+ c -= c_offset;
+ a_dim1 = aystride;
+ a_offset = 1 + a_dim1;
+ a -= a_offset;
+ b_dim1 = bystride;
+ b_offset = 1 + b_dim1;
+ b -= b_offset;
+
+ /* Early exit if possible */
+ if (m == 0 || n == 0 || k == 0)
+ return;
+
+ /* Empty c first. */
+ for (j=1; j<=n; j++)
+ for (i=1; i<=m; i++)
+ c[i + j * c_dim1] = (GFC_INTEGER_1)0;
+
+ /* Start turning the crank. */
+ i1 = n;
+ for (jj = 1; jj <= i1; jj += 512)
{
- bbase_y = bbase + y*bystride;
- dest_y = dest + y*rystride;
- for (n = 0; n < count; n++)
+ /* Computing MIN */
+ i2 = 512;
+ i3 = n - jj + 1;
+ jsec = min(i2,i3);
+ ujsec = jsec - jsec % 4;
+ i2 = k;
+ for (ll = 1; ll <= i2; ll += 256)
{
- abase_n = abase + n*aystride;
- bbase_yn = bbase_y[n];
- for (x = 0; x < xcount; x++)
+ /* Computing MIN */
+ i3 = 256;
+ i4 = k - ll + 1;
+ lsec = min(i3,i4);
+ ulsec = lsec - lsec % 2;
+
+ i3 = m;
+ for (ii = 1; ii <= i3; ii += 256)
{
- dest_y[x] += abase_n[x] * bbase_yn;
+ /* Computing MIN */
+ i4 = 256;
+ i5 = m - ii + 1;
+ isec = min(i4,i5);
+ uisec = isec - isec % 2;
+ i4 = ll + ulsec - 1;
+ for (l = ll; l <= i4; l += 2)
+ {
+ i5 = ii + uisec - 1;
+ for (i = ii; i <= i5; i += 2)
+ {
+ t1[l - ll + 1 + ((i - ii + 1) << 8) - 257] =
+ a[i + l * a_dim1];
+ t1[l - ll + 2 + ((i - ii + 1) << 8) - 257] =
+ a[i + (l + 1) * a_dim1];
+ t1[l - ll + 1 + ((i - ii + 2) << 8) - 257] =
+ a[i + 1 + l * a_dim1];
+ t1[l - ll + 2 + ((i - ii + 2) << 8) - 257] =
+ a[i + 1 + (l + 1) * a_dim1];
+ }
+ if (uisec < isec)
+ {
+ t1[l - ll + 1 + (isec << 8) - 257] =
+ a[ii + isec - 1 + l * a_dim1];
+ t1[l - ll + 2 + (isec << 8) - 257] =
+ a[ii + isec - 1 + (l + 1) * a_dim1];
+ }
+ }
+ if (ulsec < lsec)
+ {
+ i4 = ii + isec - 1;
+ for (i = ii; i<= i4; ++i)
+ {
+ t1[lsec + ((i - ii + 1) << 8) - 257] =
+ a[i + (ll + lsec - 1) * a_dim1];
+ }
+ }
+
+ uisec = isec - isec % 4;
+ i4 = jj + ujsec - 1;
+ for (j = jj; j <= i4; j += 4)
+ {
+ i5 = ii + uisec - 1;
+ for (i = ii; i <= i5; i += 4)
+ {
+ f11 = c[i + j * c_dim1];
+ f21 = c[i + 1 + j * c_dim1];
+ f12 = c[i + (j + 1) * c_dim1];
+ f22 = c[i + 1 + (j + 1) * c_dim1];
+ f13 = c[i + (j + 2) * c_dim1];
+ f23 = c[i + 1 + (j + 2) * c_dim1];
+ f14 = c[i + (j + 3) * c_dim1];
+ f24 = c[i + 1 + (j + 3) * c_dim1];
+ f31 = c[i + 2 + j * c_dim1];
+ f41 = c[i + 3 + j * c_dim1];
+ f32 = c[i + 2 + (j + 1) * c_dim1];
+ f42 = c[i + 3 + (j + 1) * c_dim1];
+ f33 = c[i + 2 + (j + 2) * c_dim1];
+ f43 = c[i + 3 + (j + 2) * c_dim1];
+ f34 = c[i + 2 + (j + 3) * c_dim1];
+ f44 = c[i + 3 + (j + 3) * c_dim1];
+ i6 = ll + lsec - 1;
+ for (l = ll; l <= i6; ++l)
+ {
+ f11 += t1[l - ll + 1 + ((i - ii + 1) << 8) - 257]
+ * b[l + j * b_dim1];
+ f21 += t1[l - ll + 1 + ((i - ii + 2) << 8) - 257]
+ * b[l + j * b_dim1];
+ f12 += t1[l - ll + 1 + ((i - ii + 1) << 8) - 257]
+ * b[l + (j + 1) * b_dim1];
+ f22 += t1[l - ll + 1 + ((i - ii + 2) << 8) - 257]
+ * b[l + (j + 1) * b_dim1];
+ f13 += t1[l - ll + 1 + ((i - ii + 1) << 8) - 257]
+ * b[l + (j + 2) * b_dim1];
+ f23 += t1[l - ll + 1 + ((i - ii + 2) << 8) - 257]
+ * b[l + (j + 2) * b_dim1];
+ f14 += t1[l - ll + 1 + ((i - ii + 1) << 8) - 257]
+ * b[l + (j + 3) * b_dim1];
+ f24 += t1[l - ll + 1 + ((i - ii + 2) << 8) - 257]
+ * b[l + (j + 3) * b_dim1];
+ f31 += t1[l - ll + 1 + ((i - ii + 3) << 8) - 257]
+ * b[l + j * b_dim1];
+ f41 += t1[l - ll + 1 + ((i - ii + 4) << 8) - 257]
+ * b[l + j * b_dim1];
+ f32 += t1[l - ll + 1 + ((i - ii + 3) << 8) - 257]
+ * b[l + (j + 1) * b_dim1];
+ f42 += t1[l - ll + 1 + ((i - ii + 4) << 8) - 257]
+ * b[l + (j + 1) * b_dim1];
+ f33 += t1[l - ll + 1 + ((i - ii + 3) << 8) - 257]
+ * b[l + (j + 2) * b_dim1];
+ f43 += t1[l - ll + 1 + ((i - ii + 4) << 8) - 257]
+ * b[l + (j + 2) * b_dim1];
+ f34 += t1[l - ll + 1 + ((i - ii + 3) << 8) - 257]
+ * b[l + (j + 3) * b_dim1];
+ f44 += t1[l - ll + 1 + ((i - ii + 4) << 8) - 257]
+ * b[l + (j + 3) * b_dim1];
+ }
+ c[i + j * c_dim1] = f11;
+ c[i + 1 + j * c_dim1] = f21;
+ c[i + (j + 1) * c_dim1] = f12;
+ c[i + 1 + (j + 1) * c_dim1] = f22;
+ c[i + (j + 2) * c_dim1] = f13;
+ c[i + 1 + (j + 2) * c_dim1] = f23;
+ c[i + (j + 3) * c_dim1] = f14;
+ c[i + 1 + (j + 3) * c_dim1] = f24;
+ c[i + 2 + j * c_dim1] = f31;
+ c[i + 3 + j * c_dim1] = f41;
+ c[i + 2 + (j + 1) * c_dim1] = f32;
+ c[i + 3 + (j + 1) * c_dim1] = f42;
+ c[i + 2 + (j + 2) * c_dim1] = f33;
+ c[i + 3 + (j + 2) * c_dim1] = f43;
+ c[i + 2 + (j + 3) * c_dim1] = f34;
+ c[i + 3 + (j + 3) * c_dim1] = f44;
+ }
+ if (uisec < isec)
+ {
+ i5 = ii + isec - 1;
+ for (i = ii + uisec; i <= i5; ++i)
+ {
+ f11 = c[i + j * c_dim1];
+ f12 = c[i + (j + 1) * c_dim1];
+ f13 = c[i + (j + 2) * c_dim1];
+ f14 = c[i + (j + 3) * c_dim1];
+ i6 = ll + lsec - 1;
+ for (l = ll; l <= i6; ++l)
+ {
+ f11 += t1[l - ll + 1 + ((i - ii + 1) << 8) -
+ 257] * b[l + j * b_dim1];
+ f12 += t1[l - ll + 1 + ((i - ii + 1) << 8) -
+ 257] * b[l + (j + 1) * b_dim1];
+ f13 += t1[l - ll + 1 + ((i - ii + 1) << 8) -
+ 257] * b[l + (j + 2) * b_dim1];
+ f14 += t1[l - ll + 1 + ((i - ii + 1) << 8) -
+ 257] * b[l + (j + 3) * b_dim1];
+ }
+ c[i + j * c_dim1] = f11;
+ c[i + (j + 1) * c_dim1] = f12;
+ c[i + (j + 2) * c_dim1] = f13;
+ c[i + (j + 3) * c_dim1] = f14;
+ }
+ }
+ }
+ if (ujsec < jsec)
+ {
+ i4 = jj + jsec - 1;
+ for (j = jj + ujsec; j <= i4; ++j)
+ {
+ i5 = ii + uisec - 1;
+ for (i = ii; i <= i5; i += 4)
+ {
+ f11 = c[i + j * c_dim1];
+ f21 = c[i + 1 + j * c_dim1];
+ f31 = c[i + 2 + j * c_dim1];
+ f41 = c[i + 3 + j * c_dim1];
+ i6 = ll + lsec - 1;
+ for (l = ll; l <= i6; ++l)
+ {
+ f11 += t1[l - ll + 1 + ((i - ii + 1) << 8) -
+ 257] * b[l + j * b_dim1];
+ f21 += t1[l - ll + 1 + ((i - ii + 2) << 8) -
+ 257] * b[l + j * b_dim1];
+ f31 += t1[l - ll + 1 + ((i - ii + 3) << 8) -
+ 257] * b[l + j * b_dim1];
+ f41 += t1[l - ll + 1 + ((i - ii + 4) << 8) -
+ 257] * b[l + j * b_dim1];
+ }
+ c[i + j * c_dim1] = f11;
+ c[i + 1 + j * c_dim1] = f21;
+ c[i + 2 + j * c_dim1] = f31;
+ c[i + 3 + j * c_dim1] = f41;
+ }
+ i5 = ii + isec - 1;
+ for (i = ii + uisec; i <= i5; ++i)
+ {
+ f11 = c[i + j * c_dim1];
+ i6 = ll + lsec - 1;
+ for (l = ll; l <= i6; ++l)
+ {
+ f11 += t1[l - ll + 1 + ((i - ii + 1) << 8) -
+ 257] * b[l + j * b_dim1];
+ }
+ c[i + j * c_dim1] = f11;
+ }
+ }
+ }
}
}
}
+ return;
}
else if (rxstride == 1 && aystride == 1 && bxstride == 1)
{
@@ -334,7 +570,9 @@ matmul_i1 (gfc_array_i1 * const restrict retarray,
for (n = 0; n < count; n++)
for (x = 0; x < xcount; x++)
/* dest[x,y] += a[x,n] * b[n,y] */
- dest[x*rxstride + y*rystride] += abase[x*axstride + n*aystride] * bbase[n*bxstride + y*bystride];
+ dest[x*rxstride + y*rystride] +=
+ abase[x*axstride + n*aystride] *
+ bbase[n*bxstride + y*bystride];
}
else if (GFC_DESCRIPTOR_RANK (a) == 1)
{
@@ -372,5 +610,5 @@ matmul_i1 (gfc_array_i1 * const restrict retarray,
}
}
}
-
+#pragma GCC reset_options
#endif
diff --git a/libgfortran/generated/matmul_i16.c b/libgfortran/generated/matmul_i16.c
index 80eb63c..bfcc028 100644
--- a/libgfortran/generated/matmul_i16.c
+++ b/libgfortran/generated/matmul_i16.c
@@ -32,7 +32,7 @@ see the files COPYING3 and COPYING.RUNTIME respectively. If not, see
#if defined (HAVE_GFC_INTEGER_16)
/* Prototype for the BLAS ?gemm subroutine, a pointer to which can be
- passed to us by the front-end, in which case we'll call it for large
+ passed to us by the front-end, in which case we call it for large
matrices. */
typedef void (*blas_call)(const char *, const char *, const int *, const int *,
@@ -75,6 +75,9 @@ extern void matmul_i16 (gfc_array_i16 * const restrict retarray,
int blas_limit, blas_call gemm);
export_proto(matmul_i16);
+#pragma GCC optimize ( "-Ofast" )
+#pragma GCC optimize ( "-funroll-loops" )
+
void
matmul_i16 (gfc_array_i16 * const restrict retarray,
gfc_array_i16 * const restrict a, gfc_array_i16 * const restrict b, int try_blas,
@@ -99,7 +102,7 @@ matmul_i16 (gfc_array_i16 * const restrict retarray,
o One-dimensional argument B is implicitly treated as a column matrix
dimensioned [count, 1], so ycount=1.
- */
+*/
if (retarray->base_addr == NULL)
{
@@ -127,47 +130,47 @@ matmul_i16 (gfc_array_i16 * const restrict retarray,
= xmallocarray (size0 ((array_t *) retarray), sizeof (GFC_INTEGER_16));
retarray->offset = 0;
}
- else if (unlikely (compile_options.bounds_check))
- {
- index_type ret_extent, arg_extent;
-
- if (GFC_DESCRIPTOR_RANK (a) == 1)
- {
- arg_extent = GFC_DESCRIPTOR_EXTENT(b,1);
- ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0);
- if (arg_extent != ret_extent)
- runtime_error ("Incorrect extent in return array in"
- " MATMUL intrinsic: is %ld, should be %ld",
- (long int) ret_extent, (long int) arg_extent);
- }
- else if (GFC_DESCRIPTOR_RANK (b) == 1)
- {
- arg_extent = GFC_DESCRIPTOR_EXTENT(a,0);
- ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0);
- if (arg_extent != ret_extent)
- runtime_error ("Incorrect extent in return array in"
- " MATMUL intrinsic: is %ld, should be %ld",
- (long int) ret_extent, (long int) arg_extent);
- }
- else
- {
- arg_extent = GFC_DESCRIPTOR_EXTENT(a,0);
- ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0);
- if (arg_extent != ret_extent)
- runtime_error ("Incorrect extent in return array in"
- " MATMUL intrinsic for dimension 1:"
- " is %ld, should be %ld",
- (long int) ret_extent, (long int) arg_extent);
-
- arg_extent = GFC_DESCRIPTOR_EXTENT(b,1);
- ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,1);
- if (arg_extent != ret_extent)
- runtime_error ("Incorrect extent in return array in"
- " MATMUL intrinsic for dimension 2:"
- " is %ld, should be %ld",
- (long int) ret_extent, (long int) arg_extent);
- }
- }
+ else if (unlikely (compile_options.bounds_check))
+ {
+ index_type ret_extent, arg_extent;
+
+ if (GFC_DESCRIPTOR_RANK (a) == 1)
+ {
+ arg_extent = GFC_DESCRIPTOR_EXTENT(b,1);
+ ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0);
+ if (arg_extent != ret_extent)
+ runtime_error ("Incorrect extent in return array in"
+ " MATMUL intrinsic: is %ld, should be %ld",
+ (long int) ret_extent, (long int) arg_extent);
+ }
+ else if (GFC_DESCRIPTOR_RANK (b) == 1)
+ {
+ arg_extent = GFC_DESCRIPTOR_EXTENT(a,0);
+ ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0);
+ if (arg_extent != ret_extent)
+ runtime_error ("Incorrect extent in return array in"
+ " MATMUL intrinsic: is %ld, should be %ld",
+ (long int) ret_extent, (long int) arg_extent);
+ }
+ else
+ {
+ arg_extent = GFC_DESCRIPTOR_EXTENT(a,0);
+ ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0);
+ if (arg_extent != ret_extent)
+ runtime_error ("Incorrect extent in return array in"
+ " MATMUL intrinsic for dimension 1:"
+ " is %ld, should be %ld",
+ (long int) ret_extent, (long int) arg_extent);
+
+ arg_extent = GFC_DESCRIPTOR_EXTENT(b,1);
+ ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,1);
+ if (arg_extent != ret_extent)
+ runtime_error ("Incorrect extent in return array in"
+ " MATMUL intrinsic for dimension 2:"
+ " is %ld, should be %ld",
+ (long int) ret_extent, (long int) arg_extent);
+ }
+ }
if (GFC_DESCRIPTOR_RANK (retarray) == 1)
@@ -230,61 +233,294 @@ matmul_i16 (gfc_array_i16 * const restrict retarray,
bbase = b->base_addr;
dest = retarray->base_addr;
-
- /* Now that everything is set up, we're performing the multiplication
+ /* Now that everything is set up, we perform the multiplication
itself. */
#define POW3(x) (((float) (x)) * ((float) (x)) * ((float) (x)))
+#define min(a,b) ((a) <= (b) ? (a) : (b))
+#define max(a,b) ((a) >= (b) ? (a) : (b))
if (try_blas && rxstride == 1 && (axstride == 1 || aystride == 1)
&& (bxstride == 1 || bystride == 1)
&& (((float) xcount) * ((float) ycount) * ((float) count)
> POW3(blas_limit)))
- {
- const int m = xcount, n = ycount, k = count, ldc = rystride;
- const GFC_INTEGER_16 one = 1, zero = 0;
- const int lda = (axstride == 1) ? aystride : axstride,
- ldb = (bxstride == 1) ? bystride : bxstride;
-
- if (lda > 0 && ldb > 0 && ldc > 0 && m > 1 && n > 1 && k > 1)
- {
- assert (gemm != NULL);
- gemm (axstride == 1 ? "N" : "T", bxstride == 1 ? "N" : "T", &m, &n, &k,
- &one, abase, &lda, bbase, &ldb, &zero, dest, &ldc, 1, 1);
- return;
- }
- }
-
- if (rxstride == 1 && axstride == 1 && bxstride == 1)
{
- const GFC_INTEGER_16 * restrict bbase_y;
- GFC_INTEGER_16 * restrict dest_y;
- const GFC_INTEGER_16 * restrict abase_n;
- GFC_INTEGER_16 bbase_yn;
+ const int m = xcount, n = ycount, k = count, ldc = rystride;
+ const GFC_INTEGER_16 one = 1, zero = 0;
+ const int lda = (axstride == 1) ? aystride : axstride,
+ ldb = (bxstride == 1) ? bystride : bxstride;
- if (rystride == xcount)
- memset (dest, 0, (sizeof (GFC_INTEGER_16) * xcount * ycount));
- else
+ if (lda > 0 && ldb > 0 && ldc > 0 && m > 1 && n > 1 && k > 1)
{
- for (y = 0; y < ycount; y++)
- for (x = 0; x < xcount; x++)
- dest[x + y*rystride] = (GFC_INTEGER_16)0;
+ assert (gemm != NULL);
+ gemm (axstride == 1 ? "N" : "T", bxstride == 1 ? "N" : "T", &m,
+ &n, &k, &one, abase, &lda, bbase, &ldb, &zero, dest,
+ &ldc, 1, 1);
+ return;
}
+ }
- for (y = 0; y < ycount; y++)
+ if (rxstride == 1 && axstride == 1 && bxstride == 1)
+ {
+ /* This block of code implements a tuned matmul, derived from
+ Superscalar GEMM-based level 3 BLAS, Beta version 0.1
+
+ Bo Kagstrom and Per Ling
+ Department of Computing Science
+ Umea University
+ S-901 87 Umea, Sweden
+
+ from netlib.org, translated to C, and modified for matmul.m4. */
+
+ const GFC_INTEGER_16 *a, *b;
+ GFC_INTEGER_16 *c;
+ const index_type m = xcount, n = ycount, k = count;
+
+ /* System generated locals */
+ index_type a_dim1, a_offset, b_dim1, b_offset, c_dim1, c_offset, i1, i2,
+ i3, i4, i5, i6;
+
+ /* Local variables */
+ GFC_INTEGER_16 t1[65536], /* was [256][256] */
+ f11, f12, f21, f22, f31, f32, f41, f42,
+ f13, f14, f23, f24, f33, f34, f43, f44;
+ index_type i, j, l, ii, jj, ll;
+ index_type isec, jsec, lsec, uisec, ujsec, ulsec;
+
+ a = abase;
+ b = bbase;
+ c = retarray->base_addr;
+
+ /* Parameter adjustments */
+ c_dim1 = m;
+ c_offset = 1 + c_dim1;
+ c -= c_offset;
+ a_dim1 = aystride;
+ a_offset = 1 + a_dim1;
+ a -= a_offset;
+ b_dim1 = bystride;
+ b_offset = 1 + b_dim1;
+ b -= b_offset;
+
+ /* Early exit if possible */
+ if (m == 0 || n == 0 || k == 0)
+ return;
+
+ /* Empty c first. */
+ for (j=1; j<=n; j++)
+ for (i=1; i<=m; i++)
+ c[i + j * c_dim1] = (GFC_INTEGER_16)0;
+
+ /* Start turning the crank. */
+ i1 = n;
+ for (jj = 1; jj <= i1; jj += 512)
{
- bbase_y = bbase + y*bystride;
- dest_y = dest + y*rystride;
- for (n = 0; n < count; n++)
+ /* Computing MIN */
+ i2 = 512;
+ i3 = n - jj + 1;
+ jsec = min(i2,i3);
+ ujsec = jsec - jsec % 4;
+ i2 = k;
+ for (ll = 1; ll <= i2; ll += 256)
{
- abase_n = abase + n*aystride;
- bbase_yn = bbase_y[n];
- for (x = 0; x < xcount; x++)
+ /* Computing MIN */
+ i3 = 256;
+ i4 = k - ll + 1;
+ lsec = min(i3,i4);
+ ulsec = lsec - lsec % 2;
+
+ i3 = m;
+ for (ii = 1; ii <= i3; ii += 256)
{
- dest_y[x] += abase_n[x] * bbase_yn;
+ /* Computing MIN */
+ i4 = 256;
+ i5 = m - ii + 1;
+ isec = min(i4,i5);
+ uisec = isec - isec % 2;
+ i4 = ll + ulsec - 1;
+ for (l = ll; l <= i4; l += 2)
+ {
+ i5 = ii + uisec - 1;
+ for (i = ii; i <= i5; i += 2)
+ {
+ t1[l - ll + 1 + ((i - ii + 1) << 8) - 257] =
+ a[i + l * a_dim1];
+ t1[l - ll + 2 + ((i - ii + 1) << 8) - 257] =
+ a[i + (l + 1) * a_dim1];
+ t1[l - ll + 1 + ((i - ii + 2) << 8) - 257] =
+ a[i + 1 + l * a_dim1];
+ t1[l - ll + 2 + ((i - ii + 2) << 8) - 257] =
+ a[i + 1 + (l + 1) * a_dim1];
+ }
+ if (uisec < isec)
+ {
+ t1[l - ll + 1 + (isec << 8) - 257] =
+ a[ii + isec - 1 + l * a_dim1];
+ t1[l - ll + 2 + (isec << 8) - 257] =
+ a[ii + isec - 1 + (l + 1) * a_dim1];
+ }
+ }
+ if (ulsec < lsec)
+ {
+ i4 = ii + isec - 1;
+ for (i = ii; i<= i4; ++i)
+ {
+ t1[lsec + ((i - ii + 1) << 8) - 257] =
+ a[i + (ll + lsec - 1) * a_dim1];
+ }
+ }
+
+ uisec = isec - isec % 4;
+ i4 = jj + ujsec - 1;
+ for (j = jj; j <= i4; j += 4)
+ {
+ i5 = ii + uisec - 1;
+ for (i = ii; i <= i5; i += 4)
+ {
+ f11 = c[i + j * c_dim1];
+ f21 = c[i + 1 + j * c_dim1];
+ f12 = c[i + (j + 1) * c_dim1];
+ f22 = c[i + 1 + (j + 1) * c_dim1];
+ f13 = c[i + (j + 2) * c_dim1];
+ f23 = c[i + 1 + (j + 2) * c_dim1];
+ f14 = c[i + (j + 3) * c_dim1];
+ f24 = c[i + 1 + (j + 3) * c_dim1];
+ f31 = c[i + 2 + j * c_dim1];
+ f41 = c[i + 3 + j * c_dim1];
+ f32 = c[i + 2 + (j + 1) * c_dim1];
+ f42 = c[i + 3 + (j + 1) * c_dim1];
+ f33 = c[i + 2 + (j + 2) * c_dim1];
+ f43 = c[i + 3 + (j + 2) * c_dim1];
+ f34 = c[i + 2 + (j + 3) * c_dim1];
+ f44 = c[i + 3 + (j + 3) * c_dim1];
+ i6 = ll + lsec - 1;
+ for (l = ll; l <= i6; ++l)
+ {
+ f11 += t1[l - ll + 1 + ((i - ii + 1) << 8) - 257]
+ * b[l + j * b_dim1];
+ f21 += t1[l - ll + 1 + ((i - ii + 2) << 8) - 257]
+ * b[l + j * b_dim1];
+ f12 += t1[l - ll + 1 + ((i - ii + 1) << 8) - 257]
+ * b[l + (j + 1) * b_dim1];
+ f22 += t1[l - ll + 1 + ((i - ii + 2) << 8) - 257]
+ * b[l + (j + 1) * b_dim1];
+ f13 += t1[l - ll + 1 + ((i - ii + 1) << 8) - 257]
+ * b[l + (j + 2) * b_dim1];
+ f23 += t1[l - ll + 1 + ((i - ii + 2) << 8) - 257]
+ * b[l + (j + 2) * b_dim1];
+ f14 += t1[l - ll + 1 + ((i - ii + 1) << 8) - 257]
+ * b[l + (j + 3) * b_dim1];
+ f24 += t1[l - ll + 1 + ((i - ii + 2) << 8) - 257]
+ * b[l + (j + 3) * b_dim1];
+ f31 += t1[l - ll + 1 + ((i - ii + 3) << 8) - 257]
+ * b[l + j * b_dim1];
+ f41 += t1[l - ll + 1 + ((i - ii + 4) << 8) - 257]
+ * b[l + j * b_dim1];
+ f32 += t1[l - ll + 1 + ((i - ii + 3) << 8) - 257]
+ * b[l + (j + 1) * b_dim1];
+ f42 += t1[l - ll + 1 + ((i - ii + 4) << 8) - 257]
+ * b[l + (j + 1) * b_dim1];
+ f33 += t1[l - ll + 1 + ((i - ii + 3) << 8) - 257]
+ * b[l + (j + 2) * b_dim1];
+ f43 += t1[l - ll + 1 + ((i - ii + 4) << 8) - 257]
+ * b[l + (j + 2) * b_dim1];
+ f34 += t1[l - ll + 1 + ((i - ii + 3) << 8) - 257]
+ * b[l + (j + 3) * b_dim1];
+ f44 += t1[l - ll + 1 + ((i - ii + 4) << 8) - 257]
+ * b[l + (j + 3) * b_dim1];
+ }
+ c[i + j * c_dim1] = f11;
+ c[i + 1 + j * c_dim1] = f21;
+ c[i + (j + 1) * c_dim1] = f12;
+ c[i + 1 + (j + 1) * c_dim1] = f22;
+ c[i + (j + 2) * c_dim1] = f13;
+ c[i + 1 + (j + 2) * c_dim1] = f23;
+ c[i + (j + 3) * c_dim1] = f14;
+ c[i + 1 + (j + 3) * c_dim1] = f24;
+ c[i + 2 + j * c_dim1] = f31;
+ c[i + 3 + j * c_dim1] = f41;
+ c[i + 2 + (j + 1) * c_dim1] = f32;
+ c[i + 3 + (j + 1) * c_dim1] = f42;
+ c[i + 2 + (j + 2) * c_dim1] = f33;
+ c[i + 3 + (j + 2) * c_dim1] = f43;
+ c[i + 2 + (j + 3) * c_dim1] = f34;
+ c[i + 3 + (j + 3) * c_dim1] = f44;
+ }
+ if (uisec < isec)
+ {
+ i5 = ii + isec - 1;
+ for (i = ii + uisec; i <= i5; ++i)
+ {
+ f11 = c[i + j * c_dim1];
+ f12 = c[i + (j + 1) * c_dim1];
+ f13 = c[i + (j + 2) * c_dim1];
+ f14 = c[i + (j + 3) * c_dim1];
+ i6 = ll + lsec - 1;
+ for (l = ll; l <= i6; ++l)
+ {
+ f11 += t1[l - ll + 1 + ((i - ii + 1) << 8) -
+ 257] * b[l + j * b_dim1];
+ f12 += t1[l - ll + 1 + ((i - ii + 1) << 8) -
+ 257] * b[l + (j + 1) * b_dim1];
+ f13 += t1[l - ll + 1 + ((i - ii + 1) << 8) -
+ 257] * b[l + (j + 2) * b_dim1];
+ f14 += t1[l - ll + 1 + ((i - ii + 1) << 8) -
+ 257] * b[l + (j + 3) * b_dim1];
+ }
+ c[i + j * c_dim1] = f11;
+ c[i + (j + 1) * c_dim1] = f12;
+ c[i + (j + 2) * c_dim1] = f13;
+ c[i + (j + 3) * c_dim1] = f14;
+ }
+ }
+ }
+ if (ujsec < jsec)
+ {
+ i4 = jj + jsec - 1;
+ for (j = jj + ujsec; j <= i4; ++j)
+ {
+ i5 = ii + uisec - 1;
+ for (i = ii; i <= i5; i += 4)
+ {
+ f11 = c[i + j * c_dim1];
+ f21 = c[i + 1 + j * c_dim1];
+ f31 = c[i + 2 + j * c_dim1];
+ f41 = c[i + 3 + j * c_dim1];
+ i6 = ll + lsec - 1;
+ for (l = ll; l <= i6; ++l)
+ {
+ f11 += t1[l - ll + 1 + ((i - ii + 1) << 8) -
+ 257] * b[l + j * b_dim1];
+ f21 += t1[l - ll + 1 + ((i - ii + 2) << 8) -
+ 257] * b[l + j * b_dim1];
+ f31 += t1[l - ll + 1 + ((i - ii + 3) << 8) -
+ 257] * b[l + j * b_dim1];
+ f41 += t1[l - ll + 1 + ((i - ii + 4) << 8) -
+ 257] * b[l + j * b_dim1];
+ }
+ c[i + j * c_dim1] = f11;
+ c[i + 1 + j * c_dim1] = f21;
+ c[i + 2 + j * c_dim1] = f31;
+ c[i + 3 + j * c_dim1] = f41;
+ }
+ i5 = ii + isec - 1;
+ for (i = ii + uisec; i <= i5; ++i)
+ {
+ f11 = c[i + j * c_dim1];
+ i6 = ll + lsec - 1;
+ for (l = ll; l <= i6; ++l)
+ {
+ f11 += t1[l - ll + 1 + ((i - ii + 1) << 8) -
+ 257] * b[l + j * b_dim1];
+ }
+ c[i + j * c_dim1] = f11;
+ }
+ }
+ }
}
}
}
+ return;
}
else if (rxstride == 1 && aystride == 1 && bxstride == 1)
{
@@ -334,7 +570,9 @@ matmul_i16 (gfc_array_i16 * const restrict retarray,
for (n = 0; n < count; n++)
for (x = 0; x < xcount; x++)
/* dest[x,y] += a[x,n] * b[n,y] */
- dest[x*rxstride + y*rystride] += abase[x*axstride + n*aystride] * bbase[n*bxstride + y*bystride];
+ dest[x*rxstride + y*rystride] +=
+ abase[x*axstride + n*aystride] *
+ bbase[n*bxstride + y*bystride];
}
else if (GFC_DESCRIPTOR_RANK (a) == 1)
{
@@ -372,5 +610,5 @@ matmul_i16 (gfc_array_i16 * const restrict retarray,
}
}
}
-
+#pragma GCC reset_options
#endif
diff --git a/libgfortran/generated/matmul_i2.c b/libgfortran/generated/matmul_i2.c
index 281a013..a65c40b 100644
--- a/libgfortran/generated/matmul_i2.c
+++ b/libgfortran/generated/matmul_i2.c
@@ -32,7 +32,7 @@ see the files COPYING3 and COPYING.RUNTIME respectively. If not, see
#if defined (HAVE_GFC_INTEGER_2)
/* Prototype for the BLAS ?gemm subroutine, a pointer to which can be
- passed to us by the front-end, in which case we'll call it for large
+ passed to us by the front-end, in which case we call it for large
matrices. */
typedef void (*blas_call)(const char *, const char *, const int *, const int *,
@@ -75,6 +75,9 @@ extern void matmul_i2 (gfc_array_i2 * const restrict retarray,
int blas_limit, blas_call gemm);
export_proto(matmul_i2);
+#pragma GCC optimize ( "-Ofast" )
+#pragma GCC optimize ( "-funroll-loops" )
+
void
matmul_i2 (gfc_array_i2 * const restrict retarray,
gfc_array_i2 * const restrict a, gfc_array_i2 * const restrict b, int try_blas,
@@ -99,7 +102,7 @@ matmul_i2 (gfc_array_i2 * const restrict retarray,
o One-dimensional argument B is implicitly treated as a column matrix
dimensioned [count, 1], so ycount=1.
- */
+*/
if (retarray->base_addr == NULL)
{
@@ -127,47 +130,47 @@ matmul_i2 (gfc_array_i2 * const restrict retarray,
= xmallocarray (size0 ((array_t *) retarray), sizeof (GFC_INTEGER_2));
retarray->offset = 0;
}
- else if (unlikely (compile_options.bounds_check))
- {
- index_type ret_extent, arg_extent;
-
- if (GFC_DESCRIPTOR_RANK (a) == 1)
- {
- arg_extent = GFC_DESCRIPTOR_EXTENT(b,1);
- ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0);
- if (arg_extent != ret_extent)
- runtime_error ("Incorrect extent in return array in"
- " MATMUL intrinsic: is %ld, should be %ld",
- (long int) ret_extent, (long int) arg_extent);
- }
- else if (GFC_DESCRIPTOR_RANK (b) == 1)
- {
- arg_extent = GFC_DESCRIPTOR_EXTENT(a,0);
- ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0);
- if (arg_extent != ret_extent)
- runtime_error ("Incorrect extent in return array in"
- " MATMUL intrinsic: is %ld, should be %ld",
- (long int) ret_extent, (long int) arg_extent);
- }
- else
- {
- arg_extent = GFC_DESCRIPTOR_EXTENT(a,0);
- ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0);
- if (arg_extent != ret_extent)
- runtime_error ("Incorrect extent in return array in"
- " MATMUL intrinsic for dimension 1:"
- " is %ld, should be %ld",
- (long int) ret_extent, (long int) arg_extent);
-
- arg_extent = GFC_DESCRIPTOR_EXTENT(b,1);
- ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,1);
- if (arg_extent != ret_extent)
- runtime_error ("Incorrect extent in return array in"
- " MATMUL intrinsic for dimension 2:"
- " is %ld, should be %ld",
- (long int) ret_extent, (long int) arg_extent);
- }
- }
+ else if (unlikely (compile_options.bounds_check))
+ {
+ index_type ret_extent, arg_extent;
+
+ if (GFC_DESCRIPTOR_RANK (a) == 1)
+ {
+ arg_extent = GFC_DESCRIPTOR_EXTENT(b,1);
+ ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0);
+ if (arg_extent != ret_extent)
+ runtime_error ("Incorrect extent in return array in"
+ " MATMUL intrinsic: is %ld, should be %ld",
+ (long int) ret_extent, (long int) arg_extent);
+ }
+ else if (GFC_DESCRIPTOR_RANK (b) == 1)
+ {
+ arg_extent = GFC_DESCRIPTOR_EXTENT(a,0);
+ ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0);
+ if (arg_extent != ret_extent)
+ runtime_error ("Incorrect extent in return array in"
+ " MATMUL intrinsic: is %ld, should be %ld",
+ (long int) ret_extent, (long int) arg_extent);
+ }
+ else
+ {
+ arg_extent = GFC_DESCRIPTOR_EXTENT(a,0);
+ ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0);
+ if (arg_extent != ret_extent)
+ runtime_error ("Incorrect extent in return array in"
+ " MATMUL intrinsic for dimension 1:"
+ " is %ld, should be %ld",
+ (long int) ret_extent, (long int) arg_extent);
+
+ arg_extent = GFC_DESCRIPTOR_EXTENT(b,1);
+ ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,1);
+ if (arg_extent != ret_extent)
+ runtime_error ("Incorrect extent in return array in"
+ " MATMUL intrinsic for dimension 2:"
+ " is %ld, should be %ld",
+ (long int) ret_extent, (long int) arg_extent);
+ }
+ }
if (GFC_DESCRIPTOR_RANK (retarray) == 1)
@@ -230,61 +233,294 @@ matmul_i2 (gfc_array_i2 * const restrict retarray,
bbase = b->base_addr;
dest = retarray->base_addr;
-
- /* Now that everything is set up, we're performing the multiplication
+ /* Now that everything is set up, we perform the multiplication
itself. */
#define POW3(x) (((float) (x)) * ((float) (x)) * ((float) (x)))
+#define min(a,b) ((a) <= (b) ? (a) : (b))
+#define max(a,b) ((a) >= (b) ? (a) : (b))
if (try_blas && rxstride == 1 && (axstride == 1 || aystride == 1)
&& (bxstride == 1 || bystride == 1)
&& (((float) xcount) * ((float) ycount) * ((float) count)
> POW3(blas_limit)))
- {
- const int m = xcount, n = ycount, k = count, ldc = rystride;
- const GFC_INTEGER_2 one = 1, zero = 0;
- const int lda = (axstride == 1) ? aystride : axstride,
- ldb = (bxstride == 1) ? bystride : bxstride;
-
- if (lda > 0 && ldb > 0 && ldc > 0 && m > 1 && n > 1 && k > 1)
- {
- assert (gemm != NULL);
- gemm (axstride == 1 ? "N" : "T", bxstride == 1 ? "N" : "T", &m, &n, &k,
- &one, abase, &lda, bbase, &ldb, &zero, dest, &ldc, 1, 1);
- return;
- }
- }
-
- if (rxstride == 1 && axstride == 1 && bxstride == 1)
{
- const GFC_INTEGER_2 * restrict bbase_y;
- GFC_INTEGER_2 * restrict dest_y;
- const GFC_INTEGER_2 * restrict abase_n;
- GFC_INTEGER_2 bbase_yn;
+ const int m = xcount, n = ycount, k = count, ldc = rystride;
+ const GFC_INTEGER_2 one = 1, zero = 0;
+ const int lda = (axstride == 1) ? aystride : axstride,
+ ldb = (bxstride == 1) ? bystride : bxstride;
- if (rystride == xcount)
- memset (dest, 0, (sizeof (GFC_INTEGER_2) * xcount * ycount));
- else
+ if (lda > 0 && ldb > 0 && ldc > 0 && m > 1 && n > 1 && k > 1)
{
- for (y = 0; y < ycount; y++)
- for (x = 0; x < xcount; x++)
- dest[x + y*rystride] = (GFC_INTEGER_2)0;
+ assert (gemm != NULL);
+ gemm (axstride == 1 ? "N" : "T", bxstride == 1 ? "N" : "T", &m,
+ &n, &k, &one, abase, &lda, bbase, &ldb, &zero, dest,
+ &ldc, 1, 1);
+ return;
}
+ }
- for (y = 0; y < ycount; y++)
+ if (rxstride == 1 && axstride == 1 && bxstride == 1)
+ {
+ /* This block of code implements a tuned matmul, derived from
+ Superscalar GEMM-based level 3 BLAS, Beta version 0.1
+
+ Bo Kagstrom and Per Ling
+ Department of Computing Science
+ Umea University
+ S-901 87 Umea, Sweden
+
+ from netlib.org, translated to C, and modified for matmul.m4. */
+
+ const GFC_INTEGER_2 *a, *b;
+ GFC_INTEGER_2 *c;
+ const index_type m = xcount, n = ycount, k = count;
+
+ /* System generated locals */
+ index_type a_dim1, a_offset, b_dim1, b_offset, c_dim1, c_offset, i1, i2,
+ i3, i4, i5, i6;
+
+ /* Local variables */
+ GFC_INTEGER_2 t1[65536], /* was [256][256] */
+ f11, f12, f21, f22, f31, f32, f41, f42,
+ f13, f14, f23, f24, f33, f34, f43, f44;
+ index_type i, j, l, ii, jj, ll;
+ index_type isec, jsec, lsec, uisec, ujsec, ulsec;
+
+ a = abase;
+ b = bbase;
+ c = retarray->base_addr;
+
+ /* Parameter adjustments */
+ c_dim1 = m;
+ c_offset = 1 + c_dim1;
+ c -= c_offset;
+ a_dim1 = aystride;
+ a_offset = 1 + a_dim1;
+ a -= a_offset;
+ b_dim1 = bystride;
+ b_offset = 1 + b_dim1;
+ b -= b_offset;
+
+ /* Early exit if possible */
+ if (m == 0 || n == 0 || k == 0)
+ return;
+
+ /* Empty c first. */
+ for (j=1; j<=n; j++)
+ for (i=1; i<=m; i++)
+ c[i + j * c_dim1] = (GFC_INTEGER_2)0;
+
+ /* Start turning the crank. */
+ i1 = n;
+ for (jj = 1; jj <= i1; jj += 512)
{
- bbase_y = bbase + y*bystride;
- dest_y = dest + y*rystride;
- for (n = 0; n < count; n++)
+ /* Computing MIN */
+ i2 = 512;
+ i3 = n - jj + 1;
+ jsec = min(i2,i3);
+ ujsec = jsec - jsec % 4;
+ i2 = k;
+ for (ll = 1; ll <= i2; ll += 256)
{
- abase_n = abase + n*aystride;
- bbase_yn = bbase_y[n];
- for (x = 0; x < xcount; x++)
+ /* Computing MIN */
+ i3 = 256;
+ i4 = k - ll + 1;
+ lsec = min(i3,i4);
+ ulsec = lsec - lsec % 2;
+
+ i3 = m;
+ for (ii = 1; ii <= i3; ii += 256)
{
- dest_y[x] += abase_n[x] * bbase_yn;
+ /* Computing MIN */
+ i4 = 256;
+ i5 = m - ii + 1;
+ isec = min(i4,i5);
+ uisec = isec - isec % 2;
+ i4 = ll + ulsec - 1;
+ for (l = ll; l <= i4; l += 2)
+ {
+ i5 = ii + uisec - 1;
+ for (i = ii; i <= i5; i += 2)
+ {
+ t1[l - ll + 1 + ((i - ii + 1) << 8) - 257] =
+ a[i + l * a_dim1];
+ t1[l - ll + 2 + ((i - ii + 1) << 8) - 257] =
+ a[i + (l + 1) * a_dim1];
+ t1[l - ll + 1 + ((i - ii + 2) << 8) - 257] =
+ a[i + 1 + l * a_dim1];
+ t1[l - ll + 2 + ((i - ii + 2) << 8) - 257] =
+ a[i + 1 + (l + 1) * a_dim1];
+ }
+ if (uisec < isec)
+ {
+ t1[l - ll + 1 + (isec << 8) - 257] =
+ a[ii + isec - 1 + l * a_dim1];
+ t1[l - ll + 2 + (isec << 8) - 257] =
+ a[ii + isec - 1 + (l + 1) * a_dim1];
+ }
+ }
+ if (ulsec < lsec)
+ {
+ i4 = ii + isec - 1;
+ for (i = ii; i<= i4; ++i)
+ {
+ t1[lsec + ((i - ii + 1) << 8) - 257] =
+ a[i + (ll + lsec - 1) * a_dim1];
+ }
+ }
+
+ uisec = isec - isec % 4;
+ i4 = jj + ujsec - 1;
+ for (j = jj; j <= i4; j += 4)
+ {
+ i5 = ii + uisec - 1;
+ for (i = ii; i <= i5; i += 4)
+ {
+ f11 = c[i + j * c_dim1];
+ f21 = c[i + 1 + j * c_dim1];
+ f12 = c[i + (j + 1) * c_dim1];
+ f22 = c[i + 1 + (j + 1) * c_dim1];
+ f13 = c[i + (j + 2) * c_dim1];
+ f23 = c[i + 1 + (j + 2) * c_dim1];
+ f14 = c[i + (j + 3) * c_dim1];
+ f24 = c[i + 1 + (j + 3) * c_dim1];
+ f31 = c[i + 2 + j * c_dim1];
+ f41 = c[i + 3 + j * c_dim1];
+ f32 = c[i + 2 + (j + 1) * c_dim1];
+ f42 = c[i + 3 + (j + 1) * c_dim1];
+ f33 = c[i + 2 + (j + 2) * c_dim1];
+ f43 = c[i + 3 + (j + 2) * c_dim1];
+ f34 = c[i + 2 + (j + 3) * c_dim1];
+ f44 = c[i + 3 + (j + 3) * c_dim1];
+ i6 = ll + lsec - 1;
+ for (l = ll; l <= i6; ++l)
+ {
+ f11 += t1[l - ll + 1 + ((i - ii + 1) << 8) - 257]
+ * b[l + j * b_dim1];
+ f21 += t1[l - ll + 1 + ((i - ii + 2) << 8) - 257]
+ * b[l + j * b_dim1];
+ f12 += t1[l - ll + 1 + ((i - ii + 1) << 8) - 257]
+ * b[l + (j + 1) * b_dim1];
+ f22 += t1[l - ll + 1 + ((i - ii + 2) << 8) - 257]
+ * b[l + (j + 1) * b_dim1];
+ f13 += t1[l - ll + 1 + ((i - ii + 1) << 8) - 257]
+ * b[l + (j + 2) * b_dim1];
+ f23 += t1[l - ll + 1 + ((i - ii + 2) << 8) - 257]
+ * b[l + (j + 2) * b_dim1];
+ f14 += t1[l - ll + 1 + ((i - ii + 1) << 8) - 257]
+ * b[l + (j + 3) * b_dim1];
+ f24 += t1[l - ll + 1 + ((i - ii + 2) << 8) - 257]
+ * b[l + (j + 3) * b_dim1];
+ f31 += t1[l - ll + 1 + ((i - ii + 3) << 8) - 257]
+ * b[l + j * b_dim1];
+ f41 += t1[l - ll + 1 + ((i - ii + 4) << 8) - 257]
+ * b[l + j * b_dim1];
+ f32 += t1[l - ll + 1 + ((i - ii + 3) << 8) - 257]
+ * b[l + (j + 1) * b_dim1];
+ f42 += t1[l - ll + 1 + ((i - ii + 4) << 8) - 257]
+ * b[l + (j + 1) * b_dim1];
+ f33 += t1[l - ll + 1 + ((i - ii + 3) << 8) - 257]
+ * b[l + (j + 2) * b_dim1];
+ f43 += t1[l - ll + 1 + ((i - ii + 4) << 8) - 257]
+ * b[l + (j + 2) * b_dim1];
+ f34 += t1[l - ll + 1 + ((i - ii + 3) << 8) - 257]
+ * b[l + (j + 3) * b_dim1];
+ f44 += t1[l - ll + 1 + ((i - ii + 4) << 8) - 257]
+ * b[l + (j + 3) * b_dim1];
+ }
+ c[i + j * c_dim1] = f11;
+ c[i + 1 + j * c_dim1] = f21;
+ c[i + (j + 1) * c_dim1] = f12;
+ c[i + 1 + (j + 1) * c_dim1] = f22;
+ c[i + (j + 2) * c_dim1] = f13;
+ c[i + 1 + (j + 2) * c_dim1] = f23;
+ c[i + (j + 3) * c_dim1] = f14;
+ c[i + 1 + (j + 3) * c_dim1] = f24;
+ c[i + 2 + j * c_dim1] = f31;
+ c[i + 3 + j * c_dim1] = f41;
+ c[i + 2 + (j + 1) * c_dim1] = f32;
+ c[i + 3 + (j + 1) * c_dim1] = f42;
+ c[i + 2 + (j + 2) * c_dim1] = f33;
+ c[i + 3 + (j + 2) * c_dim1] = f43;
+ c[i + 2 + (j + 3) * c_dim1] = f34;
+ c[i + 3 + (j + 3) * c_dim1] = f44;
+ }
+ if (uisec < isec)
+ {
+ i5 = ii + isec - 1;
+ for (i = ii + uisec; i <= i5; ++i)
+ {
+ f11 = c[i + j * c_dim1];
+ f12 = c[i + (j + 1) * c_dim1];
+ f13 = c[i + (j + 2) * c_dim1];
+ f14 = c[i + (j + 3) * c_dim1];
+ i6 = ll + lsec - 1;
+ for (l = ll; l <= i6; ++l)
+ {
+ f11 += t1[l - ll + 1 + ((i - ii + 1) << 8) -
+ 257] * b[l + j * b_dim1];
+ f12 += t1[l - ll + 1 + ((i - ii + 1) << 8) -
+ 257] * b[l + (j + 1) * b_dim1];
+ f13 += t1[l - ll + 1 + ((i - ii + 1) << 8) -
+ 257] * b[l + (j + 2) * b_dim1];
+ f14 += t1[l - ll + 1 + ((i - ii + 1) << 8) -
+ 257] * b[l + (j + 3) * b_dim1];
+ }
+ c[i + j * c_dim1] = f11;
+ c[i + (j + 1) * c_dim1] = f12;
+ c[i + (j + 2) * c_dim1] = f13;
+ c[i + (j + 3) * c_dim1] = f14;
+ }
+ }
+ }
+ if (ujsec < jsec)
+ {
+ i4 = jj + jsec - 1;
+ for (j = jj + ujsec; j <= i4; ++j)
+ {
+ i5 = ii + uisec - 1;
+ for (i = ii; i <= i5; i += 4)
+ {
+ f11 = c[i + j * c_dim1];
+ f21 = c[i + 1 + j * c_dim1];
+ f31 = c[i + 2 + j * c_dim1];
+ f41 = c[i + 3 + j * c_dim1];
+ i6 = ll + lsec - 1;
+ for (l = ll; l <= i6; ++l)
+ {
+ f11 += t1[l - ll + 1 + ((i - ii + 1) << 8) -
+ 257] * b[l + j * b_dim1];
+ f21 += t1[l - ll + 1 + ((i - ii + 2) << 8) -
+ 257] * b[l + j * b_dim1];
+ f31 += t1[l - ll + 1 + ((i - ii + 3) << 8) -
+ 257] * b[l + j * b_dim1];
+ f41 += t1[l - ll + 1 + ((i - ii + 4) << 8) -
+ 257] * b[l + j * b_dim1];
+ }
+ c[i + j * c_dim1] = f11;
+ c[i + 1 + j * c_dim1] = f21;
+ c[i + 2 + j * c_dim1] = f31;
+ c[i + 3 + j * c_dim1] = f41;
+ }
+ i5 = ii + isec - 1;
+ for (i = ii + uisec; i <= i5; ++i)
+ {
+ f11 = c[i + j * c_dim1];
+ i6 = ll + lsec - 1;
+ for (l = ll; l <= i6; ++l)
+ {
+ f11 += t1[l - ll + 1 + ((i - ii + 1) << 8) -
+ 257] * b[l + j * b_dim1];
+ }
+ c[i + j * c_dim1] = f11;
+ }
+ }
+ }
}
}
}
+ return;
}
else if (rxstride == 1 && aystride == 1 && bxstride == 1)
{
@@ -334,7 +570,9 @@ matmul_i2 (gfc_array_i2 * const restrict retarray,
for (n = 0; n < count; n++)
for (x = 0; x < xcount; x++)
/* dest[x,y] += a[x,n] * b[n,y] */
- dest[x*rxstride + y*rystride] += abase[x*axstride + n*aystride] * bbase[n*bxstride + y*bystride];
+ dest[x*rxstride + y*rystride] +=
+ abase[x*axstride + n*aystride] *
+ bbase[n*bxstride + y*bystride];
}
else if (GFC_DESCRIPTOR_RANK (a) == 1)
{
@@ -372,5 +610,5 @@ matmul_i2 (gfc_array_i2 * const restrict retarray,
}
}
}
-
+#pragma GCC reset_options
#endif
diff --git a/libgfortran/generated/matmul_i4.c b/libgfortran/generated/matmul_i4.c
index 2dc526d..933f8d5 100644
--- a/libgfortran/generated/matmul_i4.c
+++ b/libgfortran/generated/matmul_i4.c
@@ -32,7 +32,7 @@ see the files COPYING3 and COPYING.RUNTIME respectively. If not, see
#if defined (HAVE_GFC_INTEGER_4)
/* Prototype for the BLAS ?gemm subroutine, a pointer to which can be
- passed to us by the front-end, in which case we'll call it for large
+ passed to us by the front-end, in which case we call it for large
matrices. */
typedef void (*blas_call)(const char *, const char *, const int *, const int *,
@@ -75,6 +75,9 @@ extern void matmul_i4 (gfc_array_i4 * const restrict retarray,
int blas_limit, blas_call gemm);
export_proto(matmul_i4);
+#pragma GCC optimize ( "-Ofast" )
+#pragma GCC optimize ( "-funroll-loops" )
+
void
matmul_i4 (gfc_array_i4 * const restrict retarray,
gfc_array_i4 * const restrict a, gfc_array_i4 * const restrict b, int try_blas,
@@ -99,7 +102,7 @@ matmul_i4 (gfc_array_i4 * const restrict retarray,
o One-dimensional argument B is implicitly treated as a column matrix
dimensioned [count, 1], so ycount=1.
- */
+*/
if (retarray->base_addr == NULL)
{
@@ -127,47 +130,47 @@ matmul_i4 (gfc_array_i4 * const restrict retarray,
= xmallocarray (size0 ((array_t *) retarray), sizeof (GFC_INTEGER_4));
retarray->offset = 0;
}
- else if (unlikely (compile_options.bounds_check))
- {
- index_type ret_extent, arg_extent;
-
- if (GFC_DESCRIPTOR_RANK (a) == 1)
- {
- arg_extent = GFC_DESCRIPTOR_EXTENT(b,1);
- ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0);
- if (arg_extent != ret_extent)
- runtime_error ("Incorrect extent in return array in"
- " MATMUL intrinsic: is %ld, should be %ld",
- (long int) ret_extent, (long int) arg_extent);
- }
- else if (GFC_DESCRIPTOR_RANK (b) == 1)
- {
- arg_extent = GFC_DESCRIPTOR_EXTENT(a,0);
- ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0);
- if (arg_extent != ret_extent)
- runtime_error ("Incorrect extent in return array in"
- " MATMUL intrinsic: is %ld, should be %ld",
- (long int) ret_extent, (long int) arg_extent);
- }
- else
- {
- arg_extent = GFC_DESCRIPTOR_EXTENT(a,0);
- ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0);
- if (arg_extent != ret_extent)
- runtime_error ("Incorrect extent in return array in"
- " MATMUL intrinsic for dimension 1:"
- " is %ld, should be %ld",
- (long int) ret_extent, (long int) arg_extent);
-
- arg_extent = GFC_DESCRIPTOR_EXTENT(b,1);
- ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,1);
- if (arg_extent != ret_extent)
- runtime_error ("Incorrect extent in return array in"
- " MATMUL intrinsic for dimension 2:"
- " is %ld, should be %ld",
- (long int) ret_extent, (long int) arg_extent);
- }
- }
+ else if (unlikely (compile_options.bounds_check))
+ {
+ index_type ret_extent, arg_extent;
+
+ if (GFC_DESCRIPTOR_RANK (a) == 1)
+ {
+ arg_extent = GFC_DESCRIPTOR_EXTENT(b,1);
+ ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0);
+ if (arg_extent != ret_extent)
+ runtime_error ("Incorrect extent in return array in"
+ " MATMUL intrinsic: is %ld, should be %ld",
+ (long int) ret_extent, (long int) arg_extent);
+ }
+ else if (GFC_DESCRIPTOR_RANK (b) == 1)
+ {
+ arg_extent = GFC_DESCRIPTOR_EXTENT(a,0);
+ ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0);
+ if (arg_extent != ret_extent)
+ runtime_error ("Incorrect extent in return array in"
+ " MATMUL intrinsic: is %ld, should be %ld",
+ (long int) ret_extent, (long int) arg_extent);
+ }
+ else
+ {
+ arg_extent = GFC_DESCRIPTOR_EXTENT(a,0);
+ ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0);
+ if (arg_extent != ret_extent)
+ runtime_error ("Incorrect extent in return array in"
+ " MATMUL intrinsic for dimension 1:"
+ " is %ld, should be %ld",
+ (long int) ret_extent, (long int) arg_extent);
+
+ arg_extent = GFC_DESCRIPTOR_EXTENT(b,1);
+ ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,1);
+ if (arg_extent != ret_extent)
+ runtime_error ("Incorrect extent in return array in"
+ " MATMUL intrinsic for dimension 2:"
+ " is %ld, should be %ld",
+ (long int) ret_extent, (long int) arg_extent);
+ }
+ }
if (GFC_DESCRIPTOR_RANK (retarray) == 1)
@@ -230,61 +233,294 @@ matmul_i4 (gfc_array_i4 * const restrict retarray,
bbase = b->base_addr;
dest = retarray->base_addr;
-
- /* Now that everything is set up, we're performing the multiplication
+ /* Now that everything is set up, we perform the multiplication
itself. */
#define POW3(x) (((float) (x)) * ((float) (x)) * ((float) (x)))
+#define min(a,b) ((a) <= (b) ? (a) : (b))
+#define max(a,b) ((a) >= (b) ? (a) : (b))
if (try_blas && rxstride == 1 && (axstride == 1 || aystride == 1)
&& (bxstride == 1 || bystride == 1)
&& (((float) xcount) * ((float) ycount) * ((float) count)
> POW3(blas_limit)))
- {
- const int m = xcount, n = ycount, k = count, ldc = rystride;
- const GFC_INTEGER_4 one = 1, zero = 0;
- const int lda = (axstride == 1) ? aystride : axstride,
- ldb = (bxstride == 1) ? bystride : bxstride;
-
- if (lda > 0 && ldb > 0 && ldc > 0 && m > 1 && n > 1 && k > 1)
- {
- assert (gemm != NULL);
- gemm (axstride == 1 ? "N" : "T", bxstride == 1 ? "N" : "T", &m, &n, &k,
- &one, abase, &lda, bbase, &ldb, &zero, dest, &ldc, 1, 1);
- return;
- }
- }
-
- if (rxstride == 1 && axstride == 1 && bxstride == 1)
{
- const GFC_INTEGER_4 * restrict bbase_y;
- GFC_INTEGER_4 * restrict dest_y;
- const GFC_INTEGER_4 * restrict abase_n;
- GFC_INTEGER_4 bbase_yn;
+ const int m = xcount, n = ycount, k = count, ldc = rystride;
+ const GFC_INTEGER_4 one = 1, zero = 0;
+ const int lda = (axstride == 1) ? aystride : axstride,
+ ldb = (bxstride == 1) ? bystride : bxstride;
- if (rystride == xcount)
- memset (dest, 0, (sizeof (GFC_INTEGER_4) * xcount * ycount));
- else
+ if (lda > 0 && ldb > 0 && ldc > 0 && m > 1 && n > 1 && k > 1)
{
- for (y = 0; y < ycount; y++)
- for (x = 0; x < xcount; x++)
- dest[x + y*rystride] = (GFC_INTEGER_4)0;
+ assert (gemm != NULL);
+ gemm (axstride == 1 ? "N" : "T", bxstride == 1 ? "N" : "T", &m,
+ &n, &k, &one, abase, &lda, bbase, &ldb, &zero, dest,
+ &ldc, 1, 1);
+ return;
}
+ }
- for (y = 0; y < ycount; y++)
+ if (rxstride == 1 && axstride == 1 && bxstride == 1)
+ {
+ /* This block of code implements a tuned matmul, derived from
+ Superscalar GEMM-based level 3 BLAS, Beta version 0.1
+
+ Bo Kagstrom and Per Ling
+ Department of Computing Science
+ Umea University
+ S-901 87 Umea, Sweden
+
+ from netlib.org, translated to C, and modified for matmul.m4. */
+
+ const GFC_INTEGER_4 *a, *b;
+ GFC_INTEGER_4 *c;
+ const index_type m = xcount, n = ycount, k = count;
+
+ /* System generated locals */
+ index_type a_dim1, a_offset, b_dim1, b_offset, c_dim1, c_offset, i1, i2,
+ i3, i4, i5, i6;
+
+ /* Local variables */
+ GFC_INTEGER_4 t1[65536], /* was [256][256] */
+ f11, f12, f21, f22, f31, f32, f41, f42,
+ f13, f14, f23, f24, f33, f34, f43, f44;
+ index_type i, j, l, ii, jj, ll;
+ index_type isec, jsec, lsec, uisec, ujsec, ulsec;
+
+ a = abase;
+ b = bbase;
+ c = retarray->base_addr;
+
+ /* Parameter adjustments */
+ c_dim1 = m;
+ c_offset = 1 + c_dim1;
+ c -= c_offset;
+ a_dim1 = aystride;
+ a_offset = 1 + a_dim1;
+ a -= a_offset;
+ b_dim1 = bystride;
+ b_offset = 1 + b_dim1;
+ b -= b_offset;
+
+ /* Early exit if possible */
+ if (m == 0 || n == 0 || k == 0)
+ return;
+
+ /* Empty c first. */
+ for (j=1; j<=n; j++)
+ for (i=1; i<=m; i++)
+ c[i + j * c_dim1] = (GFC_INTEGER_4)0;
+
+ /* Start turning the crank. */
+ i1 = n;
+ for (jj = 1; jj <= i1; jj += 512)
{
- bbase_y = bbase + y*bystride;
- dest_y = dest + y*rystride;
- for (n = 0; n < count; n++)
+ /* Computing MIN */
+ i2 = 512;
+ i3 = n - jj + 1;
+ jsec = min(i2,i3);
+ ujsec = jsec - jsec % 4;
+ i2 = k;
+ for (ll = 1; ll <= i2; ll += 256)
{
- abase_n = abase + n*aystride;
- bbase_yn = bbase_y[n];
- for (x = 0; x < xcount; x++)
+ /* Computing MIN */
+ i3 = 256;
+ i4 = k - ll + 1;
+ lsec = min(i3,i4);
+ ulsec = lsec - lsec % 2;
+
+ i3 = m;
+ for (ii = 1; ii <= i3; ii += 256)
{
- dest_y[x] += abase_n[x] * bbase_yn;
+ /* Computing MIN */
+ i4 = 256;
+ i5 = m - ii + 1;
+ isec = min(i4,i5);
+ uisec = isec - isec % 2;
+ i4 = ll + ulsec - 1;
+ for (l = ll; l <= i4; l += 2)
+ {
+ i5 = ii + uisec - 1;
+ for (i = ii; i <= i5; i += 2)
+ {
+ t1[l - ll + 1 + ((i - ii + 1) << 8) - 257] =
+ a[i + l * a_dim1];
+ t1[l - ll + 2 + ((i - ii + 1) << 8) - 257] =
+ a[i + (l + 1) * a_dim1];
+ t1[l - ll + 1 + ((i - ii + 2) << 8) - 257] =
+ a[i + 1 + l * a_dim1];
+ t1[l - ll + 2 + ((i - ii + 2) << 8) - 257] =
+ a[i + 1 + (l + 1) * a_dim1];
+ }
+ if (uisec < isec)
+ {
+ t1[l - ll + 1 + (isec << 8) - 257] =
+ a[ii + isec - 1 + l * a_dim1];
+ t1[l - ll + 2 + (isec << 8) - 257] =
+ a[ii + isec - 1 + (l + 1) * a_dim1];
+ }
+ }
+ if (ulsec < lsec)
+ {
+ i4 = ii + isec - 1;
+ for (i = ii; i<= i4; ++i)
+ {
+ t1[lsec + ((i - ii + 1) << 8) - 257] =
+ a[i + (ll + lsec - 1) * a_dim1];
+ }
+ }
+
+ uisec = isec - isec % 4;
+ i4 = jj + ujsec - 1;
+ for (j = jj; j <= i4; j += 4)
+ {
+ i5 = ii + uisec - 1;
+ for (i = ii; i <= i5; i += 4)
+ {
+ f11 = c[i + j * c_dim1];
+ f21 = c[i + 1 + j * c_dim1];
+ f12 = c[i + (j + 1) * c_dim1];
+ f22 = c[i + 1 + (j + 1) * c_dim1];
+ f13 = c[i + (j + 2) * c_dim1];
+ f23 = c[i + 1 + (j + 2) * c_dim1];
+ f14 = c[i + (j + 3) * c_dim1];
+ f24 = c[i + 1 + (j + 3) * c_dim1];
+ f31 = c[i + 2 + j * c_dim1];
+ f41 = c[i + 3 + j * c_dim1];
+ f32 = c[i + 2 + (j + 1) * c_dim1];
+ f42 = c[i + 3 + (j + 1) * c_dim1];
+ f33 = c[i + 2 + (j + 2) * c_dim1];
+ f43 = c[i + 3 + (j + 2) * c_dim1];
+ f34 = c[i + 2 + (j + 3) * c_dim1];
+ f44 = c[i + 3 + (j + 3) * c_dim1];
+ i6 = ll + lsec - 1;
+ for (l = ll; l <= i6; ++l)
+ {
+ f11 += t1[l - ll + 1 + ((i - ii + 1) << 8) - 257]
+ * b[l + j * b_dim1];
+ f21 += t1[l - ll + 1 + ((i - ii + 2) << 8) - 257]
+ * b[l + j * b_dim1];
+ f12 += t1[l - ll + 1 + ((i - ii + 1) << 8) - 257]
+ * b[l + (j + 1) * b_dim1];
+ f22 += t1[l - ll + 1 + ((i - ii + 2) << 8) - 257]
+ * b[l + (j + 1) * b_dim1];
+ f13 += t1[l - ll + 1 + ((i - ii + 1) << 8) - 257]
+ * b[l + (j + 2) * b_dim1];
+ f23 += t1[l - ll + 1 + ((i - ii + 2) << 8) - 257]
+ * b[l + (j + 2) * b_dim1];
+ f14 += t1[l - ll + 1 + ((i - ii + 1) << 8) - 257]
+ * b[l + (j + 3) * b_dim1];
+ f24 += t1[l - ll + 1 + ((i - ii + 2) << 8) - 257]
+ * b[l + (j + 3) * b_dim1];
+ f31 += t1[l - ll + 1 + ((i - ii + 3) << 8) - 257]
+ * b[l + j * b_dim1];
+ f41 += t1[l - ll + 1 + ((i - ii + 4) << 8) - 257]
+ * b[l + j * b_dim1];
+ f32 += t1[l - ll + 1 + ((i - ii + 3) << 8) - 257]
+ * b[l + (j + 1) * b_dim1];
+ f42 += t1[l - ll + 1 + ((i - ii + 4) << 8) - 257]
+ * b[l + (j + 1) * b_dim1];
+ f33 += t1[l - ll + 1 + ((i - ii + 3) << 8) - 257]
+ * b[l + (j + 2) * b_dim1];
+ f43 += t1[l - ll + 1 + ((i - ii + 4) << 8) - 257]
+ * b[l + (j + 2) * b_dim1];
+ f34 += t1[l - ll + 1 + ((i - ii + 3) << 8) - 257]
+ * b[l + (j + 3) * b_dim1];
+ f44 += t1[l - ll + 1 + ((i - ii + 4) << 8) - 257]
+ * b[l + (j + 3) * b_dim1];
+ }
+ c[i + j * c_dim1] = f11;
+ c[i + 1 + j * c_dim1] = f21;
+ c[i + (j + 1) * c_dim1] = f12;
+ c[i + 1 + (j + 1) * c_dim1] = f22;
+ c[i + (j + 2) * c_dim1] = f13;
+ c[i + 1 + (j + 2) * c_dim1] = f23;
+ c[i + (j + 3) * c_dim1] = f14;
+ c[i + 1 + (j + 3) * c_dim1] = f24;
+ c[i + 2 + j * c_dim1] = f31;
+ c[i + 3 + j * c_dim1] = f41;
+ c[i + 2 + (j + 1) * c_dim1] = f32;
+ c[i + 3 + (j + 1) * c_dim1] = f42;
+ c[i + 2 + (j + 2) * c_dim1] = f33;
+ c[i + 3 + (j + 2) * c_dim1] = f43;
+ c[i + 2 + (j + 3) * c_dim1] = f34;
+ c[i + 3 + (j + 3) * c_dim1] = f44;
+ }
+ if (uisec < isec)
+ {
+ i5 = ii + isec - 1;
+ for (i = ii + uisec; i <= i5; ++i)
+ {
+ f11 = c[i + j * c_dim1];
+ f12 = c[i + (j + 1) * c_dim1];
+ f13 = c[i + (j + 2) * c_dim1];
+ f14 = c[i + (j + 3) * c_dim1];
+ i6 = ll + lsec - 1;
+ for (l = ll; l <= i6; ++l)
+ {
+ f11 += t1[l - ll + 1 + ((i - ii + 1) << 8) -
+ 257] * b[l + j * b_dim1];
+ f12 += t1[l - ll + 1 + ((i - ii + 1) << 8) -
+ 257] * b[l + (j + 1) * b_dim1];
+ f13 += t1[l - ll + 1 + ((i - ii + 1) << 8) -
+ 257] * b[l + (j + 2) * b_dim1];
+ f14 += t1[l - ll + 1 + ((i - ii + 1) << 8) -
+ 257] * b[l + (j + 3) * b_dim1];
+ }
+ c[i + j * c_dim1] = f11;
+ c[i + (j + 1) * c_dim1] = f12;
+ c[i + (j + 2) * c_dim1] = f13;
+ c[i + (j + 3) * c_dim1] = f14;
+ }
+ }
+ }
+ if (ujsec < jsec)
+ {
+ i4 = jj + jsec - 1;
+ for (j = jj + ujsec; j <= i4; ++j)
+ {
+ i5 = ii + uisec - 1;
+ for (i = ii; i <= i5; i += 4)
+ {
+ f11 = c[i + j * c_dim1];
+ f21 = c[i + 1 + j * c_dim1];
+ f31 = c[i + 2 + j * c_dim1];
+ f41 = c[i + 3 + j * c_dim1];
+ i6 = ll + lsec - 1;
+ for (l = ll; l <= i6; ++l)
+ {
+ f11 += t1[l - ll + 1 + ((i - ii + 1) << 8) -
+ 257] * b[l + j * b_dim1];
+ f21 += t1[l - ll + 1 + ((i - ii + 2) << 8) -
+ 257] * b[l + j * b_dim1];
+ f31 += t1[l - ll + 1 + ((i - ii + 3) << 8) -
+ 257] * b[l + j * b_dim1];
+ f41 += t1[l - ll + 1 + ((i - ii + 4) << 8) -
+ 257] * b[l + j * b_dim1];
+ }
+ c[i + j * c_dim1] = f11;
+ c[i + 1 + j * c_dim1] = f21;
+ c[i + 2 + j * c_dim1] = f31;
+ c[i + 3 + j * c_dim1] = f41;
+ }
+ i5 = ii + isec - 1;
+ for (i = ii + uisec; i <= i5; ++i)
+ {
+ f11 = c[i + j * c_dim1];
+ i6 = ll + lsec - 1;
+ for (l = ll; l <= i6; ++l)
+ {
+ f11 += t1[l - ll + 1 + ((i - ii + 1) << 8) -
+ 257] * b[l + j * b_dim1];
+ }
+ c[i + j * c_dim1] = f11;
+ }
+ }
+ }
}
}
}
+ return;
}
else if (rxstride == 1 && aystride == 1 && bxstride == 1)
{
@@ -334,7 +570,9 @@ matmul_i4 (gfc_array_i4 * const restrict retarray,
for (n = 0; n < count; n++)
for (x = 0; x < xcount; x++)
/* dest[x,y] += a[x,n] * b[n,y] */
- dest[x*rxstride + y*rystride] += abase[x*axstride + n*aystride] * bbase[n*bxstride + y*bystride];
+ dest[x*rxstride + y*rystride] +=
+ abase[x*axstride + n*aystride] *
+ bbase[n*bxstride + y*bystride];
}
else if (GFC_DESCRIPTOR_RANK (a) == 1)
{
@@ -372,5 +610,5 @@ matmul_i4 (gfc_array_i4 * const restrict retarray,
}
}
}
-
+#pragma GCC reset_options
#endif
diff --git a/libgfortran/generated/matmul_i8.c b/libgfortran/generated/matmul_i8.c
index 0ff728d..62f82b9 100644
--- a/libgfortran/generated/matmul_i8.c
+++ b/libgfortran/generated/matmul_i8.c
@@ -32,7 +32,7 @@ see the files COPYING3 and COPYING.RUNTIME respectively. If not, see
#if defined (HAVE_GFC_INTEGER_8)
/* Prototype for the BLAS ?gemm subroutine, a pointer to which can be
- passed to us by the front-end, in which case we'll call it for large
+ passed to us by the front-end, in which case we call it for large
matrices. */
typedef void (*blas_call)(const char *, const char *, const int *, const int *,
@@ -75,6 +75,9 @@ extern void matmul_i8 (gfc_array_i8 * const restrict retarray,
int blas_limit, blas_call gemm);
export_proto(matmul_i8);
+#pragma GCC optimize ( "-Ofast" )
+#pragma GCC optimize ( "-funroll-loops" )
+
void
matmul_i8 (gfc_array_i8 * const restrict retarray,
gfc_array_i8 * const restrict a, gfc_array_i8 * const restrict b, int try_blas,
@@ -99,7 +102,7 @@ matmul_i8 (gfc_array_i8 * const restrict retarray,
o One-dimensional argument B is implicitly treated as a column matrix
dimensioned [count, 1], so ycount=1.
- */
+*/
if (retarray->base_addr == NULL)
{
@@ -127,47 +130,47 @@ matmul_i8 (gfc_array_i8 * const restrict retarray,
= xmallocarray (size0 ((array_t *) retarray), sizeof (GFC_INTEGER_8));
retarray->offset = 0;
}
- else if (unlikely (compile_options.bounds_check))
- {
- index_type ret_extent, arg_extent;
-
- if (GFC_DESCRIPTOR_RANK (a) == 1)
- {
- arg_extent = GFC_DESCRIPTOR_EXTENT(b,1);
- ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0);
- if (arg_extent != ret_extent)
- runtime_error ("Incorrect extent in return array in"
- " MATMUL intrinsic: is %ld, should be %ld",
- (long int) ret_extent, (long int) arg_extent);
- }
- else if (GFC_DESCRIPTOR_RANK (b) == 1)
- {
- arg_extent = GFC_DESCRIPTOR_EXTENT(a,0);
- ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0);
- if (arg_extent != ret_extent)
- runtime_error ("Incorrect extent in return array in"
- " MATMUL intrinsic: is %ld, should be %ld",
- (long int) ret_extent, (long int) arg_extent);
- }
- else
- {
- arg_extent = GFC_DESCRIPTOR_EXTENT(a,0);
- ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0);
- if (arg_extent != ret_extent)
- runtime_error ("Incorrect extent in return array in"
- " MATMUL intrinsic for dimension 1:"
- " is %ld, should be %ld",
- (long int) ret_extent, (long int) arg_extent);
-
- arg_extent = GFC_DESCRIPTOR_EXTENT(b,1);
- ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,1);
- if (arg_extent != ret_extent)
- runtime_error ("Incorrect extent in return array in"
- " MATMUL intrinsic for dimension 2:"
- " is %ld, should be %ld",
- (long int) ret_extent, (long int) arg_extent);
- }
- }
+ else if (unlikely (compile_options.bounds_check))
+ {
+ index_type ret_extent, arg_extent;
+
+ if (GFC_DESCRIPTOR_RANK (a) == 1)
+ {
+ arg_extent = GFC_DESCRIPTOR_EXTENT(b,1);
+ ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0);
+ if (arg_extent != ret_extent)
+ runtime_error ("Incorrect extent in return array in"
+ " MATMUL intrinsic: is %ld, should be %ld",
+ (long int) ret_extent, (long int) arg_extent);
+ }
+ else if (GFC_DESCRIPTOR_RANK (b) == 1)
+ {
+ arg_extent = GFC_DESCRIPTOR_EXTENT(a,0);
+ ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0);
+ if (arg_extent != ret_extent)
+ runtime_error ("Incorrect extent in return array in"
+ " MATMUL intrinsic: is %ld, should be %ld",
+ (long int) ret_extent, (long int) arg_extent);
+ }
+ else
+ {
+ arg_extent = GFC_DESCRIPTOR_EXTENT(a,0);
+ ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0);
+ if (arg_extent != ret_extent)
+ runtime_error ("Incorrect extent in return array in"
+ " MATMUL intrinsic for dimension 1:"
+ " is %ld, should be %ld",
+ (long int) ret_extent, (long int) arg_extent);
+
+ arg_extent = GFC_DESCRIPTOR_EXTENT(b,1);
+ ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,1);
+ if (arg_extent != ret_extent)
+ runtime_error ("Incorrect extent in return array in"
+ " MATMUL intrinsic for dimension 2:"
+ " is %ld, should be %ld",
+ (long int) ret_extent, (long int) arg_extent);
+ }
+ }
if (GFC_DESCRIPTOR_RANK (retarray) == 1)
@@ -230,61 +233,294 @@ matmul_i8 (gfc_array_i8 * const restrict retarray,
bbase = b->base_addr;
dest = retarray->base_addr;
-
- /* Now that everything is set up, we're performing the multiplication
+ /* Now that everything is set up, we perform the multiplication
itself. */
#define POW3(x) (((float) (x)) * ((float) (x)) * ((float) (x)))
+#define min(a,b) ((a) <= (b) ? (a) : (b))
+#define max(a,b) ((a) >= (b) ? (a) : (b))
if (try_blas && rxstride == 1 && (axstride == 1 || aystride == 1)
&& (bxstride == 1 || bystride == 1)
&& (((float) xcount) * ((float) ycount) * ((float) count)
> POW3(blas_limit)))
- {
- const int m = xcount, n = ycount, k = count, ldc = rystride;
- const GFC_INTEGER_8 one = 1, zero = 0;
- const int lda = (axstride == 1) ? aystride : axstride,
- ldb = (bxstride == 1) ? bystride : bxstride;
-
- if (lda > 0 && ldb > 0 && ldc > 0 && m > 1 && n > 1 && k > 1)
- {
- assert (gemm != NULL);
- gemm (axstride == 1 ? "N" : "T", bxstride == 1 ? "N" : "T", &m, &n, &k,
- &one, abase, &lda, bbase, &ldb, &zero, dest, &ldc, 1, 1);
- return;
- }
- }
-
- if (rxstride == 1 && axstride == 1 && bxstride == 1)
{
- const GFC_INTEGER_8 * restrict bbase_y;
- GFC_INTEGER_8 * restrict dest_y;
- const GFC_INTEGER_8 * restrict abase_n;
- GFC_INTEGER_8 bbase_yn;
+ const int m = xcount, n = ycount, k = count, ldc = rystride;
+ const GFC_INTEGER_8 one = 1, zero = 0;
+ const int lda = (axstride == 1) ? aystride : axstride,
+ ldb = (bxstride == 1) ? bystride : bxstride;
- if (rystride == xcount)
- memset (dest, 0, (sizeof (GFC_INTEGER_8) * xcount * ycount));
- else
+ if (lda > 0 && ldb > 0 && ldc > 0 && m > 1 && n > 1 && k > 1)
{
- for (y = 0; y < ycount; y++)
- for (x = 0; x < xcount; x++)
- dest[x + y*rystride] = (GFC_INTEGER_8)0;
+ assert (gemm != NULL);
+ gemm (axstride == 1 ? "N" : "T", bxstride == 1 ? "N" : "T", &m,
+ &n, &k, &one, abase, &lda, bbase, &ldb, &zero, dest,
+ &ldc, 1, 1);
+ return;
}
+ }
- for (y = 0; y < ycount; y++)
+ if (rxstride == 1 && axstride == 1 && bxstride == 1)
+ {
+ /* This block of code implements a tuned matmul, derived from
+ Superscalar GEMM-based level 3 BLAS, Beta version 0.1
+
+ Bo Kagstrom and Per Ling
+ Department of Computing Science
+ Umea University
+ S-901 87 Umea, Sweden
+
+ from netlib.org, translated to C, and modified for matmul.m4. */
+
+ const GFC_INTEGER_8 *a, *b;
+ GFC_INTEGER_8 *c;
+ const index_type m = xcount, n = ycount, k = count;
+
+ /* System generated locals */
+ index_type a_dim1, a_offset, b_dim1, b_offset, c_dim1, c_offset, i1, i2,
+ i3, i4, i5, i6;
+
+ /* Local variables */
+ GFC_INTEGER_8 t1[65536], /* was [256][256] */
+ f11, f12, f21, f22, f31, f32, f41, f42,
+ f13, f14, f23, f24, f33, f34, f43, f44;
+ index_type i, j, l, ii, jj, ll;
+ index_type isec, jsec, lsec, uisec, ujsec, ulsec;
+
+ a = abase;
+ b = bbase;
+ c = retarray->base_addr;
+
+ /* Parameter adjustments */
+ c_dim1 = m;
+ c_offset = 1 + c_dim1;
+ c -= c_offset;
+ a_dim1 = aystride;
+ a_offset = 1 + a_dim1;
+ a -= a_offset;
+ b_dim1 = bystride;
+ b_offset = 1 + b_dim1;
+ b -= b_offset;
+
+ /* Early exit if possible */
+ if (m == 0 || n == 0 || k == 0)
+ return;
+
+ /* Empty c first. */
+ for (j=1; j<=n; j++)
+ for (i=1; i<=m; i++)
+ c[i + j * c_dim1] = (GFC_INTEGER_8)0;
+
+ /* Start turning the crank. */
+ i1 = n;
+ for (jj = 1; jj <= i1; jj += 512)
{
- bbase_y = bbase + y*bystride;
- dest_y = dest + y*rystride;
- for (n = 0; n < count; n++)
+ /* Computing MIN */
+ i2 = 512;
+ i3 = n - jj + 1;
+ jsec = min(i2,i3);
+ ujsec = jsec - jsec % 4;
+ i2 = k;
+ for (ll = 1; ll <= i2; ll += 256)
{
- abase_n = abase + n*aystride;
- bbase_yn = bbase_y[n];
- for (x = 0; x < xcount; x++)
+ /* Computing MIN */
+ i3 = 256;
+ i4 = k - ll + 1;
+ lsec = min(i3,i4);
+ ulsec = lsec - lsec % 2;
+
+ i3 = m;
+ for (ii = 1; ii <= i3; ii += 256)
{
- dest_y[x] += abase_n[x] * bbase_yn;
+ /* Computing MIN */
+ i4 = 256;
+ i5 = m - ii + 1;
+ isec = min(i4,i5);
+ uisec = isec - isec % 2;
+ i4 = ll + ulsec - 1;
+ for (l = ll; l <= i4; l += 2)
+ {
+ i5 = ii + uisec - 1;
+ for (i = ii; i <= i5; i += 2)
+ {
+ t1[l - ll + 1 + ((i - ii + 1) << 8) - 257] =
+ a[i + l * a_dim1];
+ t1[l - ll + 2 + ((i - ii + 1) << 8) - 257] =
+ a[i + (l + 1) * a_dim1];
+ t1[l - ll + 1 + ((i - ii + 2) << 8) - 257] =
+ a[i + 1 + l * a_dim1];
+ t1[l - ll + 2 + ((i - ii + 2) << 8) - 257] =
+ a[i + 1 + (l + 1) * a_dim1];
+ }
+ if (uisec < isec)
+ {
+ t1[l - ll + 1 + (isec << 8) - 257] =
+ a[ii + isec - 1 + l * a_dim1];
+ t1[l - ll + 2 + (isec << 8) - 257] =
+ a[ii + isec - 1 + (l + 1) * a_dim1];
+ }
+ }
+ if (ulsec < lsec)
+ {
+ i4 = ii + isec - 1;
+ for (i = ii; i<= i4; ++i)
+ {
+ t1[lsec + ((i - ii + 1) << 8) - 257] =
+ a[i + (ll + lsec - 1) * a_dim1];
+ }
+ }
+
+ uisec = isec - isec % 4;
+ i4 = jj + ujsec - 1;
+ for (j = jj; j <= i4; j += 4)
+ {
+ i5 = ii + uisec - 1;
+ for (i = ii; i <= i5; i += 4)
+ {
+ f11 = c[i + j * c_dim1];
+ f21 = c[i + 1 + j * c_dim1];
+ f12 = c[i + (j + 1) * c_dim1];
+ f22 = c[i + 1 + (j + 1) * c_dim1];
+ f13 = c[i + (j + 2) * c_dim1];
+ f23 = c[i + 1 + (j + 2) * c_dim1];
+ f14 = c[i + (j + 3) * c_dim1];
+ f24 = c[i + 1 + (j + 3) * c_dim1];
+ f31 = c[i + 2 + j * c_dim1];
+ f41 = c[i + 3 + j * c_dim1];
+ f32 = c[i + 2 + (j + 1) * c_dim1];
+ f42 = c[i + 3 + (j + 1) * c_dim1];
+ f33 = c[i + 2 + (j + 2) * c_dim1];
+ f43 = c[i + 3 + (j + 2) * c_dim1];
+ f34 = c[i + 2 + (j + 3) * c_dim1];
+ f44 = c[i + 3 + (j + 3) * c_dim1];
+ i6 = ll + lsec - 1;
+ for (l = ll; l <= i6; ++l)
+ {
+ f11 += t1[l - ll + 1 + ((i - ii + 1) << 8) - 257]
+ * b[l + j * b_dim1];
+ f21 += t1[l - ll + 1 + ((i - ii + 2) << 8) - 257]
+ * b[l + j * b_dim1];
+ f12 += t1[l - ll + 1 + ((i - ii + 1) << 8) - 257]
+ * b[l + (j + 1) * b_dim1];
+ f22 += t1[l - ll + 1 + ((i - ii + 2) << 8) - 257]
+ * b[l + (j + 1) * b_dim1];
+ f13 += t1[l - ll + 1 + ((i - ii + 1) << 8) - 257]
+ * b[l + (j + 2) * b_dim1];
+ f23 += t1[l - ll + 1 + ((i - ii + 2) << 8) - 257]
+ * b[l + (j + 2) * b_dim1];
+ f14 += t1[l - ll + 1 + ((i - ii + 1) << 8) - 257]
+ * b[l + (j + 3) * b_dim1];
+ f24 += t1[l - ll + 1 + ((i - ii + 2) << 8) - 257]
+ * b[l + (j + 3) * b_dim1];
+ f31 += t1[l - ll + 1 + ((i - ii + 3) << 8) - 257]
+ * b[l + j * b_dim1];
+ f41 += t1[l - ll + 1 + ((i - ii + 4) << 8) - 257]
+ * b[l + j * b_dim1];
+ f32 += t1[l - ll + 1 + ((i - ii + 3) << 8) - 257]
+ * b[l + (j + 1) * b_dim1];
+ f42 += t1[l - ll + 1 + ((i - ii + 4) << 8) - 257]
+ * b[l + (j + 1) * b_dim1];
+ f33 += t1[l - ll + 1 + ((i - ii + 3) << 8) - 257]
+ * b[l + (j + 2) * b_dim1];
+ f43 += t1[l - ll + 1 + ((i - ii + 4) << 8) - 257]
+ * b[l + (j + 2) * b_dim1];
+ f34 += t1[l - ll + 1 + ((i - ii + 3) << 8) - 257]
+ * b[l + (j + 3) * b_dim1];
+ f44 += t1[l - ll + 1 + ((i - ii + 4) << 8) - 257]
+ * b[l + (j + 3) * b_dim1];
+ }
+ c[i + j * c_dim1] = f11;
+ c[i + 1 + j * c_dim1] = f21;
+ c[i + (j + 1) * c_dim1] = f12;
+ c[i + 1 + (j + 1) * c_dim1] = f22;
+ c[i + (j + 2) * c_dim1] = f13;
+ c[i + 1 + (j + 2) * c_dim1] = f23;
+ c[i + (j + 3) * c_dim1] = f14;
+ c[i + 1 + (j + 3) * c_dim1] = f24;
+ c[i + 2 + j * c_dim1] = f31;
+ c[i + 3 + j * c_dim1] = f41;
+ c[i + 2 + (j + 1) * c_dim1] = f32;
+ c[i + 3 + (j + 1) * c_dim1] = f42;
+ c[i + 2 + (j + 2) * c_dim1] = f33;
+ c[i + 3 + (j + 2) * c_dim1] = f43;
+ c[i + 2 + (j + 3) * c_dim1] = f34;
+ c[i + 3 + (j + 3) * c_dim1] = f44;
+ }
+ if (uisec < isec)
+ {
+ i5 = ii + isec - 1;
+ for (i = ii + uisec; i <= i5; ++i)
+ {
+ f11 = c[i + j * c_dim1];
+ f12 = c[i + (j + 1) * c_dim1];
+ f13 = c[i + (j + 2) * c_dim1];
+ f14 = c[i + (j + 3) * c_dim1];
+ i6 = ll + lsec - 1;
+ for (l = ll; l <= i6; ++l)
+ {
+ f11 += t1[l - ll + 1 + ((i - ii + 1) << 8) -
+ 257] * b[l + j * b_dim1];
+ f12 += t1[l - ll + 1 + ((i - ii + 1) << 8) -
+ 257] * b[l + (j + 1) * b_dim1];
+ f13 += t1[l - ll + 1 + ((i - ii + 1) << 8) -
+ 257] * b[l + (j + 2) * b_dim1];
+ f14 += t1[l - ll + 1 + ((i - ii + 1) << 8) -
+ 257] * b[l + (j + 3) * b_dim1];
+ }
+ c[i + j * c_dim1] = f11;
+ c[i + (j + 1) * c_dim1] = f12;
+ c[i + (j + 2) * c_dim1] = f13;
+ c[i + (j + 3) * c_dim1] = f14;
+ }
+ }
+ }
+ if (ujsec < jsec)
+ {
+ i4 = jj + jsec - 1;
+ for (j = jj + ujsec; j <= i4; ++j)
+ {
+ i5 = ii + uisec - 1;
+ for (i = ii; i <= i5; i += 4)
+ {
+ f11 = c[i + j * c_dim1];
+ f21 = c[i + 1 + j * c_dim1];
+ f31 = c[i + 2 + j * c_dim1];
+ f41 = c[i + 3 + j * c_dim1];
+ i6 = ll + lsec - 1;
+ for (l = ll; l <= i6; ++l)
+ {
+ f11 += t1[l - ll + 1 + ((i - ii + 1) << 8) -
+ 257] * b[l + j * b_dim1];
+ f21 += t1[l - ll + 1 + ((i - ii + 2) << 8) -
+ 257] * b[l + j * b_dim1];
+ f31 += t1[l - ll + 1 + ((i - ii + 3) << 8) -
+ 257] * b[l + j * b_dim1];
+ f41 += t1[l - ll + 1 + ((i - ii + 4) << 8) -
+ 257] * b[l + j * b_dim1];
+ }
+ c[i + j * c_dim1] = f11;
+ c[i + 1 + j * c_dim1] = f21;
+ c[i + 2 + j * c_dim1] = f31;
+ c[i + 3 + j * c_dim1] = f41;
+ }
+ i5 = ii + isec - 1;
+ for (i = ii + uisec; i <= i5; ++i)
+ {
+ f11 = c[i + j * c_dim1];
+ i6 = ll + lsec - 1;
+ for (l = ll; l <= i6; ++l)
+ {
+ f11 += t1[l - ll + 1 + ((i - ii + 1) << 8) -
+ 257] * b[l + j * b_dim1];
+ }
+ c[i + j * c_dim1] = f11;
+ }
+ }
+ }
}
}
}
+ return;
}
else if (rxstride == 1 && aystride == 1 && bxstride == 1)
{
@@ -334,7 +570,9 @@ matmul_i8 (gfc_array_i8 * const restrict retarray,
for (n = 0; n < count; n++)
for (x = 0; x < xcount; x++)
/* dest[x,y] += a[x,n] * b[n,y] */
- dest[x*rxstride + y*rystride] += abase[x*axstride + n*aystride] * bbase[n*bxstride + y*bystride];
+ dest[x*rxstride + y*rystride] +=
+ abase[x*axstride + n*aystride] *
+ bbase[n*bxstride + y*bystride];
}
else if (GFC_DESCRIPTOR_RANK (a) == 1)
{
@@ -372,5 +610,5 @@ matmul_i8 (gfc_array_i8 * const restrict retarray,
}
}
}
-
+#pragma GCC reset_options
#endif
diff --git a/libgfortran/generated/matmul_r10.c b/libgfortran/generated/matmul_r10.c
index a34856f..3d8c32c 100644
--- a/libgfortran/generated/matmul_r10.c
+++ b/libgfortran/generated/matmul_r10.c
@@ -32,7 +32,7 @@ see the files COPYING3 and COPYING.RUNTIME respectively. If not, see
#if defined (HAVE_GFC_REAL_10)
/* Prototype for the BLAS ?gemm subroutine, a pointer to which can be
- passed to us by the front-end, in which case we'll call it for large
+ passed to us by the front-end, in which case we call it for large
matrices. */
typedef void (*blas_call)(const char *, const char *, const int *, const int *,
@@ -75,6 +75,9 @@ extern void matmul_r10 (gfc_array_r10 * const restrict retarray,
int blas_limit, blas_call gemm);
export_proto(matmul_r10);
+#pragma GCC optimize ( "-Ofast" )
+#pragma GCC optimize ( "-funroll-loops" )
+
void
matmul_r10 (gfc_array_r10 * const restrict retarray,
gfc_array_r10 * const restrict a, gfc_array_r10 * const restrict b, int try_blas,
@@ -99,7 +102,7 @@ matmul_r10 (gfc_array_r10 * const restrict retarray,
o One-dimensional argument B is implicitly treated as a column matrix
dimensioned [count, 1], so ycount=1.
- */
+*/
if (retarray->base_addr == NULL)
{
@@ -127,47 +130,47 @@ matmul_r10 (gfc_array_r10 * const restrict retarray,
= xmallocarray (size0 ((array_t *) retarray), sizeof (GFC_REAL_10));
retarray->offset = 0;
}
- else if (unlikely (compile_options.bounds_check))
- {
- index_type ret_extent, arg_extent;
-
- if (GFC_DESCRIPTOR_RANK (a) == 1)
- {
- arg_extent = GFC_DESCRIPTOR_EXTENT(b,1);
- ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0);
- if (arg_extent != ret_extent)
- runtime_error ("Incorrect extent in return array in"
- " MATMUL intrinsic: is %ld, should be %ld",
- (long int) ret_extent, (long int) arg_extent);
- }
- else if (GFC_DESCRIPTOR_RANK (b) == 1)
- {
- arg_extent = GFC_DESCRIPTOR_EXTENT(a,0);
- ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0);
- if (arg_extent != ret_extent)
- runtime_error ("Incorrect extent in return array in"
- " MATMUL intrinsic: is %ld, should be %ld",
- (long int) ret_extent, (long int) arg_extent);
- }
- else
- {
- arg_extent = GFC_DESCRIPTOR_EXTENT(a,0);
- ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0);
- if (arg_extent != ret_extent)
- runtime_error ("Incorrect extent in return array in"
- " MATMUL intrinsic for dimension 1:"
- " is %ld, should be %ld",
- (long int) ret_extent, (long int) arg_extent);
-
- arg_extent = GFC_DESCRIPTOR_EXTENT(b,1);
- ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,1);
- if (arg_extent != ret_extent)
- runtime_error ("Incorrect extent in return array in"
- " MATMUL intrinsic for dimension 2:"
- " is %ld, should be %ld",
- (long int) ret_extent, (long int) arg_extent);
- }
- }
+ else if (unlikely (compile_options.bounds_check))
+ {
+ index_type ret_extent, arg_extent;
+
+ if (GFC_DESCRIPTOR_RANK (a) == 1)
+ {
+ arg_extent = GFC_DESCRIPTOR_EXTENT(b,1);
+ ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0);
+ if (arg_extent != ret_extent)
+ runtime_error ("Incorrect extent in return array in"
+ " MATMUL intrinsic: is %ld, should be %ld",
+ (long int) ret_extent, (long int) arg_extent);
+ }
+ else if (GFC_DESCRIPTOR_RANK (b) == 1)
+ {
+ arg_extent = GFC_DESCRIPTOR_EXTENT(a,0);
+ ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0);
+ if (arg_extent != ret_extent)
+ runtime_error ("Incorrect extent in return array in"
+ " MATMUL intrinsic: is %ld, should be %ld",
+ (long int) ret_extent, (long int) arg_extent);
+ }
+ else
+ {
+ arg_extent = GFC_DESCRIPTOR_EXTENT(a,0);
+ ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0);
+ if (arg_extent != ret_extent)
+ runtime_error ("Incorrect extent in return array in"
+ " MATMUL intrinsic for dimension 1:"
+ " is %ld, should be %ld",
+ (long int) ret_extent, (long int) arg_extent);
+
+ arg_extent = GFC_DESCRIPTOR_EXTENT(b,1);
+ ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,1);
+ if (arg_extent != ret_extent)
+ runtime_error ("Incorrect extent in return array in"
+ " MATMUL intrinsic for dimension 2:"
+ " is %ld, should be %ld",
+ (long int) ret_extent, (long int) arg_extent);
+ }
+ }
if (GFC_DESCRIPTOR_RANK (retarray) == 1)
@@ -230,61 +233,294 @@ matmul_r10 (gfc_array_r10 * const restrict retarray,
bbase = b->base_addr;
dest = retarray->base_addr;
-
- /* Now that everything is set up, we're performing the multiplication
+ /* Now that everything is set up, we perform the multiplication
itself. */
#define POW3(x) (((float) (x)) * ((float) (x)) * ((float) (x)))
+#define min(a,b) ((a) <= (b) ? (a) : (b))
+#define max(a,b) ((a) >= (b) ? (a) : (b))
if (try_blas && rxstride == 1 && (axstride == 1 || aystride == 1)
&& (bxstride == 1 || bystride == 1)
&& (((float) xcount) * ((float) ycount) * ((float) count)
> POW3(blas_limit)))
- {
- const int m = xcount, n = ycount, k = count, ldc = rystride;
- const GFC_REAL_10 one = 1, zero = 0;
- const int lda = (axstride == 1) ? aystride : axstride,
- ldb = (bxstride == 1) ? bystride : bxstride;
-
- if (lda > 0 && ldb > 0 && ldc > 0 && m > 1 && n > 1 && k > 1)
- {
- assert (gemm != NULL);
- gemm (axstride == 1 ? "N" : "T", bxstride == 1 ? "N" : "T", &m, &n, &k,
- &one, abase, &lda, bbase, &ldb, &zero, dest, &ldc, 1, 1);
- return;
- }
- }
-
- if (rxstride == 1 && axstride == 1 && bxstride == 1)
{
- const GFC_REAL_10 * restrict bbase_y;
- GFC_REAL_10 * restrict dest_y;
- const GFC_REAL_10 * restrict abase_n;
- GFC_REAL_10 bbase_yn;
+ const int m = xcount, n = ycount, k = count, ldc = rystride;
+ const GFC_REAL_10 one = 1, zero = 0;
+ const int lda = (axstride == 1) ? aystride : axstride,
+ ldb = (bxstride == 1) ? bystride : bxstride;
- if (rystride == xcount)
- memset (dest, 0, (sizeof (GFC_REAL_10) * xcount * ycount));
- else
+ if (lda > 0 && ldb > 0 && ldc > 0 && m > 1 && n > 1 && k > 1)
{
- for (y = 0; y < ycount; y++)
- for (x = 0; x < xcount; x++)
- dest[x + y*rystride] = (GFC_REAL_10)0;
+ assert (gemm != NULL);
+ gemm (axstride == 1 ? "N" : "T", bxstride == 1 ? "N" : "T", &m,
+ &n, &k, &one, abase, &lda, bbase, &ldb, &zero, dest,
+ &ldc, 1, 1);
+ return;
}
+ }
- for (y = 0; y < ycount; y++)
+ if (rxstride == 1 && axstride == 1 && bxstride == 1)
+ {
+ /* This block of code implements a tuned matmul, derived from
+ Superscalar GEMM-based level 3 BLAS, Beta version 0.1
+
+ Bo Kagstrom and Per Ling
+ Department of Computing Science
+ Umea University
+ S-901 87 Umea, Sweden
+
+ from netlib.org, translated to C, and modified for matmul.m4. */
+
+ const GFC_REAL_10 *a, *b;
+ GFC_REAL_10 *c;
+ const index_type m = xcount, n = ycount, k = count;
+
+ /* System generated locals */
+ index_type a_dim1, a_offset, b_dim1, b_offset, c_dim1, c_offset, i1, i2,
+ i3, i4, i5, i6;
+
+ /* Local variables */
+ GFC_REAL_10 t1[65536], /* was [256][256] */
+ f11, f12, f21, f22, f31, f32, f41, f42,
+ f13, f14, f23, f24, f33, f34, f43, f44;
+ index_type i, j, l, ii, jj, ll;
+ index_type isec, jsec, lsec, uisec, ujsec, ulsec;
+
+ a = abase;
+ b = bbase;
+ c = retarray->base_addr;
+
+ /* Parameter adjustments */
+ c_dim1 = m;
+ c_offset = 1 + c_dim1;
+ c -= c_offset;
+ a_dim1 = aystride;
+ a_offset = 1 + a_dim1;
+ a -= a_offset;
+ b_dim1 = bystride;
+ b_offset = 1 + b_dim1;
+ b -= b_offset;
+
+ /* Early exit if possible */
+ if (m == 0 || n == 0 || k == 0)
+ return;
+
+ /* Empty c first. */
+ for (j=1; j<=n; j++)
+ for (i=1; i<=m; i++)
+ c[i + j * c_dim1] = (GFC_REAL_10)0;
+
+ /* Start turning the crank. */
+ i1 = n;
+ for (jj = 1; jj <= i1; jj += 512)
{
- bbase_y = bbase + y*bystride;
- dest_y = dest + y*rystride;
- for (n = 0; n < count; n++)
+ /* Computing MIN */
+ i2 = 512;
+ i3 = n - jj + 1;
+ jsec = min(i2,i3);
+ ujsec = jsec - jsec % 4;
+ i2 = k;
+ for (ll = 1; ll <= i2; ll += 256)
{
- abase_n = abase + n*aystride;
- bbase_yn = bbase_y[n];
- for (x = 0; x < xcount; x++)
+ /* Computing MIN */
+ i3 = 256;
+ i4 = k - ll + 1;
+ lsec = min(i3,i4);
+ ulsec = lsec - lsec % 2;
+
+ i3 = m;
+ for (ii = 1; ii <= i3; ii += 256)
{
- dest_y[x] += abase_n[x] * bbase_yn;
+ /* Computing MIN */
+ i4 = 256;
+ i5 = m - ii + 1;
+ isec = min(i4,i5);
+ uisec = isec - isec % 2;
+ i4 = ll + ulsec - 1;
+ for (l = ll; l <= i4; l += 2)
+ {
+ i5 = ii + uisec - 1;
+ for (i = ii; i <= i5; i += 2)
+ {
+ t1[l - ll + 1 + ((i - ii + 1) << 8) - 257] =
+ a[i + l * a_dim1];
+ t1[l - ll + 2 + ((i - ii + 1) << 8) - 257] =
+ a[i + (l + 1) * a_dim1];
+ t1[l - ll + 1 + ((i - ii + 2) << 8) - 257] =
+ a[i + 1 + l * a_dim1];
+ t1[l - ll + 2 + ((i - ii + 2) << 8) - 257] =
+ a[i + 1 + (l + 1) * a_dim1];
+ }
+ if (uisec < isec)
+ {
+ t1[l - ll + 1 + (isec << 8) - 257] =
+ a[ii + isec - 1 + l * a_dim1];
+ t1[l - ll + 2 + (isec << 8) - 257] =
+ a[ii + isec - 1 + (l + 1) * a_dim1];
+ }
+ }
+ if (ulsec < lsec)
+ {
+ i4 = ii + isec - 1;
+ for (i = ii; i<= i4; ++i)
+ {
+ t1[lsec + ((i - ii + 1) << 8) - 257] =
+ a[i + (ll + lsec - 1) * a_dim1];
+ }
+ }
+
+ uisec = isec - isec % 4;
+ i4 = jj + ujsec - 1;
+ for (j = jj; j <= i4; j += 4)
+ {
+ i5 = ii + uisec - 1;
+ for (i = ii; i <= i5; i += 4)
+ {
+ f11 = c[i + j * c_dim1];
+ f21 = c[i + 1 + j * c_dim1];
+ f12 = c[i + (j + 1) * c_dim1];
+ f22 = c[i + 1 + (j + 1) * c_dim1];
+ f13 = c[i + (j + 2) * c_dim1];
+ f23 = c[i + 1 + (j + 2) * c_dim1];
+ f14 = c[i + (j + 3) * c_dim1];
+ f24 = c[i + 1 + (j + 3) * c_dim1];
+ f31 = c[i + 2 + j * c_dim1];
+ f41 = c[i + 3 + j * c_dim1];
+ f32 = c[i + 2 + (j + 1) * c_dim1];
+ f42 = c[i + 3 + (j + 1) * c_dim1];
+ f33 = c[i + 2 + (j + 2) * c_dim1];
+ f43 = c[i + 3 + (j + 2) * c_dim1];
+ f34 = c[i + 2 + (j + 3) * c_dim1];
+ f44 = c[i + 3 + (j + 3) * c_dim1];
+ i6 = ll + lsec - 1;
+ for (l = ll; l <= i6; ++l)
+ {
+ f11 += t1[l - ll + 1 + ((i - ii + 1) << 8) - 257]
+ * b[l + j * b_dim1];
+ f21 += t1[l - ll + 1 + ((i - ii + 2) << 8) - 257]
+ * b[l + j * b_dim1];
+ f12 += t1[l - ll + 1 + ((i - ii + 1) << 8) - 257]
+ * b[l + (j + 1) * b_dim1];
+ f22 += t1[l - ll + 1 + ((i - ii + 2) << 8) - 257]
+ * b[l + (j + 1) * b_dim1];
+ f13 += t1[l - ll + 1 + ((i - ii + 1) << 8) - 257]
+ * b[l + (j + 2) * b_dim1];
+ f23 += t1[l - ll + 1 + ((i - ii + 2) << 8) - 257]
+ * b[l + (j + 2) * b_dim1];
+ f14 += t1[l - ll + 1 + ((i - ii + 1) << 8) - 257]
+ * b[l + (j + 3) * b_dim1];
+ f24 += t1[l - ll + 1 + ((i - ii + 2) << 8) - 257]
+ * b[l + (j + 3) * b_dim1];
+ f31 += t1[l - ll + 1 + ((i - ii + 3) << 8) - 257]
+ * b[l + j * b_dim1];
+ f41 += t1[l - ll + 1 + ((i - ii + 4) << 8) - 257]
+ * b[l + j * b_dim1];
+ f32 += t1[l - ll + 1 + ((i - ii + 3) << 8) - 257]
+ * b[l + (j + 1) * b_dim1];
+ f42 += t1[l - ll + 1 + ((i - ii + 4) << 8) - 257]
+ * b[l + (j + 1) * b_dim1];
+ f33 += t1[l - ll + 1 + ((i - ii + 3) << 8) - 257]
+ * b[l + (j + 2) * b_dim1];
+ f43 += t1[l - ll + 1 + ((i - ii + 4) << 8) - 257]
+ * b[l + (j + 2) * b_dim1];
+ f34 += t1[l - ll + 1 + ((i - ii + 3) << 8) - 257]
+ * b[l + (j + 3) * b_dim1];
+ f44 += t1[l - ll + 1 + ((i - ii + 4) << 8) - 257]
+ * b[l + (j + 3) * b_dim1];
+ }
+ c[i + j * c_dim1] = f11;
+ c[i + 1 + j * c_dim1] = f21;
+ c[i + (j + 1) * c_dim1] = f12;
+ c[i + 1 + (j + 1) * c_dim1] = f22;
+ c[i + (j + 2) * c_dim1] = f13;
+ c[i + 1 + (j + 2) * c_dim1] = f23;
+ c[i + (j + 3) * c_dim1] = f14;
+ c[i + 1 + (j + 3) * c_dim1] = f24;
+ c[i + 2 + j * c_dim1] = f31;
+ c[i + 3 + j * c_dim1] = f41;
+ c[i + 2 + (j + 1) * c_dim1] = f32;
+ c[i + 3 + (j + 1) * c_dim1] = f42;
+ c[i + 2 + (j + 2) * c_dim1] = f33;
+ c[i + 3 + (j + 2) * c_dim1] = f43;
+ c[i + 2 + (j + 3) * c_dim1] = f34;
+ c[i + 3 + (j + 3) * c_dim1] = f44;
+ }
+ if (uisec < isec)
+ {
+ i5 = ii + isec - 1;
+ for (i = ii + uisec; i <= i5; ++i)
+ {
+ f11 = c[i + j * c_dim1];
+ f12 = c[i + (j + 1) * c_dim1];
+ f13 = c[i + (j + 2) * c_dim1];
+ f14 = c[i + (j + 3) * c_dim1];
+ i6 = ll + lsec - 1;
+ for (l = ll; l <= i6; ++l)
+ {
+ f11 += t1[l - ll + 1 + ((i - ii + 1) << 8) -
+ 257] * b[l + j * b_dim1];
+ f12 += t1[l - ll + 1 + ((i - ii + 1) << 8) -
+ 257] * b[l + (j + 1) * b_dim1];
+ f13 += t1[l - ll + 1 + ((i - ii + 1) << 8) -
+ 257] * b[l + (j + 2) * b_dim1];
+ f14 += t1[l - ll + 1 + ((i - ii + 1) << 8) -
+ 257] * b[l + (j + 3) * b_dim1];
+ }
+ c[i + j * c_dim1] = f11;
+ c[i + (j + 1) * c_dim1] = f12;
+ c[i + (j + 2) * c_dim1] = f13;
+ c[i + (j + 3) * c_dim1] = f14;
+ }
+ }
+ }
+ if (ujsec < jsec)
+ {
+ i4 = jj + jsec - 1;
+ for (j = jj + ujsec; j <= i4; ++j)
+ {
+ i5 = ii + uisec - 1;
+ for (i = ii; i <= i5; i += 4)
+ {
+ f11 = c[i + j * c_dim1];
+ f21 = c[i + 1 + j * c_dim1];
+ f31 = c[i + 2 + j * c_dim1];
+ f41 = c[i + 3 + j * c_dim1];
+ i6 = ll + lsec - 1;
+ for (l = ll; l <= i6; ++l)
+ {
+ f11 += t1[l - ll + 1 + ((i - ii + 1) << 8) -
+ 257] * b[l + j * b_dim1];
+ f21 += t1[l - ll + 1 + ((i - ii + 2) << 8) -
+ 257] * b[l + j * b_dim1];
+ f31 += t1[l - ll + 1 + ((i - ii + 3) << 8) -
+ 257] * b[l + j * b_dim1];
+ f41 += t1[l - ll + 1 + ((i - ii + 4) << 8) -
+ 257] * b[l + j * b_dim1];
+ }
+ c[i + j * c_dim1] = f11;
+ c[i + 1 + j * c_dim1] = f21;
+ c[i + 2 + j * c_dim1] = f31;
+ c[i + 3 + j * c_dim1] = f41;
+ }
+ i5 = ii + isec - 1;
+ for (i = ii + uisec; i <= i5; ++i)
+ {
+ f11 = c[i + j * c_dim1];
+ i6 = ll + lsec - 1;
+ for (l = ll; l <= i6; ++l)
+ {
+ f11 += t1[l - ll + 1 + ((i - ii + 1) << 8) -
+ 257] * b[l + j * b_dim1];
+ }
+ c[i + j * c_dim1] = f11;
+ }
+ }
+ }
}
}
}
+ return;
}
else if (rxstride == 1 && aystride == 1 && bxstride == 1)
{
@@ -334,7 +570,9 @@ matmul_r10 (gfc_array_r10 * const restrict retarray,
for (n = 0; n < count; n++)
for (x = 0; x < xcount; x++)
/* dest[x,y] += a[x,n] * b[n,y] */
- dest[x*rxstride + y*rystride] += abase[x*axstride + n*aystride] * bbase[n*bxstride + y*bystride];
+ dest[x*rxstride + y*rystride] +=
+ abase[x*axstride + n*aystride] *
+ bbase[n*bxstride + y*bystride];
}
else if (GFC_DESCRIPTOR_RANK (a) == 1)
{
@@ -372,5 +610,5 @@ matmul_r10 (gfc_array_r10 * const restrict retarray,
}
}
}
-
+#pragma GCC reset_options
#endif
diff --git a/libgfortran/generated/matmul_r16.c b/libgfortran/generated/matmul_r16.c
index d2f11bd..e5a0a92 100644
--- a/libgfortran/generated/matmul_r16.c
+++ b/libgfortran/generated/matmul_r16.c
@@ -32,7 +32,7 @@ see the files COPYING3 and COPYING.RUNTIME respectively. If not, see
#if defined (HAVE_GFC_REAL_16)
/* Prototype for the BLAS ?gemm subroutine, a pointer to which can be
- passed to us by the front-end, in which case we'll call it for large
+ passed to us by the front-end, in which case we call it for large
matrices. */
typedef void (*blas_call)(const char *, const char *, const int *, const int *,
@@ -75,6 +75,9 @@ extern void matmul_r16 (gfc_array_r16 * const restrict retarray,
int blas_limit, blas_call gemm);
export_proto(matmul_r16);
+#pragma GCC optimize ( "-Ofast" )
+#pragma GCC optimize ( "-funroll-loops" )
+
void
matmul_r16 (gfc_array_r16 * const restrict retarray,
gfc_array_r16 * const restrict a, gfc_array_r16 * const restrict b, int try_blas,
@@ -99,7 +102,7 @@ matmul_r16 (gfc_array_r16 * const restrict retarray,
o One-dimensional argument B is implicitly treated as a column matrix
dimensioned [count, 1], so ycount=1.
- */
+*/
if (retarray->base_addr == NULL)
{
@@ -127,47 +130,47 @@ matmul_r16 (gfc_array_r16 * const restrict retarray,
= xmallocarray (size0 ((array_t *) retarray), sizeof (GFC_REAL_16));
retarray->offset = 0;
}
- else if (unlikely (compile_options.bounds_check))
- {
- index_type ret_extent, arg_extent;
-
- if (GFC_DESCRIPTOR_RANK (a) == 1)
- {
- arg_extent = GFC_DESCRIPTOR_EXTENT(b,1);
- ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0);
- if (arg_extent != ret_extent)
- runtime_error ("Incorrect extent in return array in"
- " MATMUL intrinsic: is %ld, should be %ld",
- (long int) ret_extent, (long int) arg_extent);
- }
- else if (GFC_DESCRIPTOR_RANK (b) == 1)
- {
- arg_extent = GFC_DESCRIPTOR_EXTENT(a,0);
- ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0);
- if (arg_extent != ret_extent)
- runtime_error ("Incorrect extent in return array in"
- " MATMUL intrinsic: is %ld, should be %ld",
- (long int) ret_extent, (long int) arg_extent);
- }
- else
- {
- arg_extent = GFC_DESCRIPTOR_EXTENT(a,0);
- ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0);
- if (arg_extent != ret_extent)
- runtime_error ("Incorrect extent in return array in"
- " MATMUL intrinsic for dimension 1:"
- " is %ld, should be %ld",
- (long int) ret_extent, (long int) arg_extent);
-
- arg_extent = GFC_DESCRIPTOR_EXTENT(b,1);
- ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,1);
- if (arg_extent != ret_extent)
- runtime_error ("Incorrect extent in return array in"
- " MATMUL intrinsic for dimension 2:"
- " is %ld, should be %ld",
- (long int) ret_extent, (long int) arg_extent);
- }
- }
+ else if (unlikely (compile_options.bounds_check))
+ {
+ index_type ret_extent, arg_extent;
+
+ if (GFC_DESCRIPTOR_RANK (a) == 1)
+ {
+ arg_extent = GFC_DESCRIPTOR_EXTENT(b,1);
+ ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0);
+ if (arg_extent != ret_extent)
+ runtime_error ("Incorrect extent in return array in"
+ " MATMUL intrinsic: is %ld, should be %ld",
+ (long int) ret_extent, (long int) arg_extent);
+ }
+ else if (GFC_DESCRIPTOR_RANK (b) == 1)
+ {
+ arg_extent = GFC_DESCRIPTOR_EXTENT(a,0);
+ ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0);
+ if (arg_extent != ret_extent)
+ runtime_error ("Incorrect extent in return array in"
+ " MATMUL intrinsic: is %ld, should be %ld",
+ (long int) ret_extent, (long int) arg_extent);
+ }
+ else
+ {
+ arg_extent = GFC_DESCRIPTOR_EXTENT(a,0);
+ ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0);
+ if (arg_extent != ret_extent)
+ runtime_error ("Incorrect extent in return array in"
+ " MATMUL intrinsic for dimension 1:"
+ " is %ld, should be %ld",
+ (long int) ret_extent, (long int) arg_extent);
+
+ arg_extent = GFC_DESCRIPTOR_EXTENT(b,1);
+ ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,1);
+ if (arg_extent != ret_extent)
+ runtime_error ("Incorrect extent in return array in"
+ " MATMUL intrinsic for dimension 2:"
+ " is %ld, should be %ld",
+ (long int) ret_extent, (long int) arg_extent);
+ }
+ }
if (GFC_DESCRIPTOR_RANK (retarray) == 1)
@@ -230,61 +233,294 @@ matmul_r16 (gfc_array_r16 * const restrict retarray,
bbase = b->base_addr;
dest = retarray->base_addr;
-
- /* Now that everything is set up, we're performing the multiplication
+ /* Now that everything is set up, we perform the multiplication
itself. */
#define POW3(x) (((float) (x)) * ((float) (x)) * ((float) (x)))
+#define min(a,b) ((a) <= (b) ? (a) : (b))
+#define max(a,b) ((a) >= (b) ? (a) : (b))
if (try_blas && rxstride == 1 && (axstride == 1 || aystride == 1)
&& (bxstride == 1 || bystride == 1)
&& (((float) xcount) * ((float) ycount) * ((float) count)
> POW3(blas_limit)))
- {
- const int m = xcount, n = ycount, k = count, ldc = rystride;
- const GFC_REAL_16 one = 1, zero = 0;
- const int lda = (axstride == 1) ? aystride : axstride,
- ldb = (bxstride == 1) ? bystride : bxstride;
-
- if (lda > 0 && ldb > 0 && ldc > 0 && m > 1 && n > 1 && k > 1)
- {
- assert (gemm != NULL);
- gemm (axstride == 1 ? "N" : "T", bxstride == 1 ? "N" : "T", &m, &n, &k,
- &one, abase, &lda, bbase, &ldb, &zero, dest, &ldc, 1, 1);
- return;
- }
- }
-
- if (rxstride == 1 && axstride == 1 && bxstride == 1)
{
- const GFC_REAL_16 * restrict bbase_y;
- GFC_REAL_16 * restrict dest_y;
- const GFC_REAL_16 * restrict abase_n;
- GFC_REAL_16 bbase_yn;
+ const int m = xcount, n = ycount, k = count, ldc = rystride;
+ const GFC_REAL_16 one = 1, zero = 0;
+ const int lda = (axstride == 1) ? aystride : axstride,
+ ldb = (bxstride == 1) ? bystride : bxstride;
- if (rystride == xcount)
- memset (dest, 0, (sizeof (GFC_REAL_16) * xcount * ycount));
- else
+ if (lda > 0 && ldb > 0 && ldc > 0 && m > 1 && n > 1 && k > 1)
{
- for (y = 0; y < ycount; y++)
- for (x = 0; x < xcount; x++)
- dest[x + y*rystride] = (GFC_REAL_16)0;
+ assert (gemm != NULL);
+ gemm (axstride == 1 ? "N" : "T", bxstride == 1 ? "N" : "T", &m,
+ &n, &k, &one, abase, &lda, bbase, &ldb, &zero, dest,
+ &ldc, 1, 1);
+ return;
}
+ }
- for (y = 0; y < ycount; y++)
+ if (rxstride == 1 && axstride == 1 && bxstride == 1)
+ {
+ /* This block of code implements a tuned matmul, derived from
+ Superscalar GEMM-based level 3 BLAS, Beta version 0.1
+
+ Bo Kagstrom and Per Ling
+ Department of Computing Science
+ Umea University
+ S-901 87 Umea, Sweden
+
+ from netlib.org, translated to C, and modified for matmul.m4. */
+
+ const GFC_REAL_16 *a, *b;
+ GFC_REAL_16 *c;
+ const index_type m = xcount, n = ycount, k = count;
+
+ /* System generated locals */
+ index_type a_dim1, a_offset, b_dim1, b_offset, c_dim1, c_offset, i1, i2,
+ i3, i4, i5, i6;
+
+ /* Local variables */
+ GFC_REAL_16 t1[65536], /* was [256][256] */
+ f11, f12, f21, f22, f31, f32, f41, f42,
+ f13, f14, f23, f24, f33, f34, f43, f44;
+ index_type i, j, l, ii, jj, ll;
+ index_type isec, jsec, lsec, uisec, ujsec, ulsec;
+
+ a = abase;
+ b = bbase;
+ c = retarray->base_addr;
+
+ /* Parameter adjustments */
+ c_dim1 = m;
+ c_offset = 1 + c_dim1;
+ c -= c_offset;
+ a_dim1 = aystride;
+ a_offset = 1 + a_dim1;
+ a -= a_offset;
+ b_dim1 = bystride;
+ b_offset = 1 + b_dim1;
+ b -= b_offset;
+
+ /* Early exit if possible */
+ if (m == 0 || n == 0 || k == 0)
+ return;
+
+ /* Empty c first. */
+ for (j=1; j<=n; j++)
+ for (i=1; i<=m; i++)
+ c[i + j * c_dim1] = (GFC_REAL_16)0;
+
+ /* Start turning the crank. */
+ i1 = n;
+ for (jj = 1; jj <= i1; jj += 512)
{
- bbase_y = bbase + y*bystride;
- dest_y = dest + y*rystride;
- for (n = 0; n < count; n++)
+ /* Computing MIN */
+ i2 = 512;
+ i3 = n - jj + 1;
+ jsec = min(i2,i3);
+ ujsec = jsec - jsec % 4;
+ i2 = k;
+ for (ll = 1; ll <= i2; ll += 256)
{
- abase_n = abase + n*aystride;
- bbase_yn = bbase_y[n];
- for (x = 0; x < xcount; x++)
+ /* Computing MIN */
+ i3 = 256;
+ i4 = k - ll + 1;
+ lsec = min(i3,i4);
+ ulsec = lsec - lsec % 2;
+
+ i3 = m;
+ for (ii = 1; ii <= i3; ii += 256)
{
- dest_y[x] += abase_n[x] * bbase_yn;
+ /* Computing MIN */
+ i4 = 256;
+ i5 = m - ii + 1;
+ isec = min(i4,i5);
+ uisec = isec - isec % 2;
+ i4 = ll + ulsec - 1;
+ for (l = ll; l <= i4; l += 2)
+ {
+ i5 = ii + uisec - 1;
+ for (i = ii; i <= i5; i += 2)
+ {
+ t1[l - ll + 1 + ((i - ii + 1) << 8) - 257] =
+ a[i + l * a_dim1];
+ t1[l - ll + 2 + ((i - ii + 1) << 8) - 257] =
+ a[i + (l + 1) * a_dim1];
+ t1[l - ll + 1 + ((i - ii + 2) << 8) - 257] =
+ a[i + 1 + l * a_dim1];
+ t1[l - ll + 2 + ((i - ii + 2) << 8) - 257] =
+ a[i + 1 + (l + 1) * a_dim1];
+ }
+ if (uisec < isec)
+ {
+ t1[l - ll + 1 + (isec << 8) - 257] =
+ a[ii + isec - 1 + l * a_dim1];
+ t1[l - ll + 2 + (isec << 8) - 257] =
+ a[ii + isec - 1 + (l + 1) * a_dim1];
+ }
+ }
+ if (ulsec < lsec)
+ {
+ i4 = ii + isec - 1;
+ for (i = ii; i<= i4; ++i)
+ {
+ t1[lsec + ((i - ii + 1) << 8) - 257] =
+ a[i + (ll + lsec - 1) * a_dim1];
+ }
+ }
+
+ uisec = isec - isec % 4;
+ i4 = jj + ujsec - 1;
+ for (j = jj; j <= i4; j += 4)
+ {
+ i5 = ii + uisec - 1;
+ for (i = ii; i <= i5; i += 4)
+ {
+ f11 = c[i + j * c_dim1];
+ f21 = c[i + 1 + j * c_dim1];
+ f12 = c[i + (j + 1) * c_dim1];
+ f22 = c[i + 1 + (j + 1) * c_dim1];
+ f13 = c[i + (j + 2) * c_dim1];
+ f23 = c[i + 1 + (j + 2) * c_dim1];
+ f14 = c[i + (j + 3) * c_dim1];
+ f24 = c[i + 1 + (j + 3) * c_dim1];
+ f31 = c[i + 2 + j * c_dim1];
+ f41 = c[i + 3 + j * c_dim1];
+ f32 = c[i + 2 + (j + 1) * c_dim1];
+ f42 = c[i + 3 + (j + 1) * c_dim1];
+ f33 = c[i + 2 + (j + 2) * c_dim1];
+ f43 = c[i + 3 + (j + 2) * c_dim1];
+ f34 = c[i + 2 + (j + 3) * c_dim1];
+ f44 = c[i + 3 + (j + 3) * c_dim1];
+ i6 = ll + lsec - 1;
+ for (l = ll; l <= i6; ++l)
+ {
+ f11 += t1[l - ll + 1 + ((i - ii + 1) << 8) - 257]
+ * b[l + j * b_dim1];
+ f21 += t1[l - ll + 1 + ((i - ii + 2) << 8) - 257]
+ * b[l + j * b_dim1];
+ f12 += t1[l - ll + 1 + ((i - ii + 1) << 8) - 257]
+ * b[l + (j + 1) * b_dim1];
+ f22 += t1[l - ll + 1 + ((i - ii + 2) << 8) - 257]
+ * b[l + (j + 1) * b_dim1];
+ f13 += t1[l - ll + 1 + ((i - ii + 1) << 8) - 257]
+ * b[l + (j + 2) * b_dim1];
+ f23 += t1[l - ll + 1 + ((i - ii + 2) << 8) - 257]
+ * b[l + (j + 2) * b_dim1];
+ f14 += t1[l - ll + 1 + ((i - ii + 1) << 8) - 257]
+ * b[l + (j + 3) * b_dim1];
+ f24 += t1[l - ll + 1 + ((i - ii + 2) << 8) - 257]
+ * b[l + (j + 3) * b_dim1];
+ f31 += t1[l - ll + 1 + ((i - ii + 3) << 8) - 257]
+ * b[l + j * b_dim1];
+ f41 += t1[l - ll + 1 + ((i - ii + 4) << 8) - 257]
+ * b[l + j * b_dim1];
+ f32 += t1[l - ll + 1 + ((i - ii + 3) << 8) - 257]
+ * b[l + (j + 1) * b_dim1];
+ f42 += t1[l - ll + 1 + ((i - ii + 4) << 8) - 257]
+ * b[l + (j + 1) * b_dim1];
+ f33 += t1[l - ll + 1 + ((i - ii + 3) << 8) - 257]
+ * b[l + (j + 2) * b_dim1];
+ f43 += t1[l - ll + 1 + ((i - ii + 4) << 8) - 257]
+ * b[l + (j + 2) * b_dim1];
+ f34 += t1[l - ll + 1 + ((i - ii + 3) << 8) - 257]
+ * b[l + (j + 3) * b_dim1];
+ f44 += t1[l - ll + 1 + ((i - ii + 4) << 8) - 257]
+ * b[l + (j + 3) * b_dim1];
+ }
+ c[i + j * c_dim1] = f11;
+ c[i + 1 + j * c_dim1] = f21;
+ c[i + (j + 1) * c_dim1] = f12;
+ c[i + 1 + (j + 1) * c_dim1] = f22;
+ c[i + (j + 2) * c_dim1] = f13;
+ c[i + 1 + (j + 2) * c_dim1] = f23;
+ c[i + (j + 3) * c_dim1] = f14;
+ c[i + 1 + (j + 3) * c_dim1] = f24;
+ c[i + 2 + j * c_dim1] = f31;
+ c[i + 3 + j * c_dim1] = f41;
+ c[i + 2 + (j + 1) * c_dim1] = f32;
+ c[i + 3 + (j + 1) * c_dim1] = f42;
+ c[i + 2 + (j + 2) * c_dim1] = f33;
+ c[i + 3 + (j + 2) * c_dim1] = f43;
+ c[i + 2 + (j + 3) * c_dim1] = f34;
+ c[i + 3 + (j + 3) * c_dim1] = f44;
+ }
+ if (uisec < isec)
+ {
+ i5 = ii + isec - 1;
+ for (i = ii + uisec; i <= i5; ++i)
+ {
+ f11 = c[i + j * c_dim1];
+ f12 = c[i + (j + 1) * c_dim1];
+ f13 = c[i + (j + 2) * c_dim1];
+ f14 = c[i + (j + 3) * c_dim1];
+ i6 = ll + lsec - 1;
+ for (l = ll; l <= i6; ++l)
+ {
+ f11 += t1[l - ll + 1 + ((i - ii + 1) << 8) -
+ 257] * b[l + j * b_dim1];
+ f12 += t1[l - ll + 1 + ((i - ii + 1) << 8) -
+ 257] * b[l + (j + 1) * b_dim1];
+ f13 += t1[l - ll + 1 + ((i - ii + 1) << 8) -
+ 257] * b[l + (j + 2) * b_dim1];
+ f14 += t1[l - ll + 1 + ((i - ii + 1) << 8) -
+ 257] * b[l + (j + 3) * b_dim1];
+ }
+ c[i + j * c_dim1] = f11;
+ c[i + (j + 1) * c_dim1] = f12;
+ c[i + (j + 2) * c_dim1] = f13;
+ c[i + (j + 3) * c_dim1] = f14;
+ }
+ }
+ }
+ if (ujsec < jsec)
+ {
+ i4 = jj + jsec - 1;
+ for (j = jj + ujsec; j <= i4; ++j)
+ {
+ i5 = ii + uisec - 1;
+ for (i = ii; i <= i5; i += 4)
+ {
+ f11 = c[i + j * c_dim1];
+ f21 = c[i + 1 + j * c_dim1];
+ f31 = c[i + 2 + j * c_dim1];
+ f41 = c[i + 3 + j * c_dim1];
+ i6 = ll + lsec - 1;
+ for (l = ll; l <= i6; ++l)
+ {
+ f11 += t1[l - ll + 1 + ((i - ii + 1) << 8) -
+ 257] * b[l + j * b_dim1];
+ f21 += t1[l - ll + 1 + ((i - ii + 2) << 8) -
+ 257] * b[l + j * b_dim1];
+ f31 += t1[l - ll + 1 + ((i - ii + 3) << 8) -
+ 257] * b[l + j * b_dim1];
+ f41 += t1[l - ll + 1 + ((i - ii + 4) << 8) -
+ 257] * b[l + j * b_dim1];
+ }
+ c[i + j * c_dim1] = f11;
+ c[i + 1 + j * c_dim1] = f21;
+ c[i + 2 + j * c_dim1] = f31;
+ c[i + 3 + j * c_dim1] = f41;
+ }
+ i5 = ii + isec - 1;
+ for (i = ii + uisec; i <= i5; ++i)
+ {
+ f11 = c[i + j * c_dim1];
+ i6 = ll + lsec - 1;
+ for (l = ll; l <= i6; ++l)
+ {
+ f11 += t1[l - ll + 1 + ((i - ii + 1) << 8) -
+ 257] * b[l + j * b_dim1];
+ }
+ c[i + j * c_dim1] = f11;
+ }
+ }
+ }
}
}
}
+ return;
}
else if (rxstride == 1 && aystride == 1 && bxstride == 1)
{
@@ -334,7 +570,9 @@ matmul_r16 (gfc_array_r16 * const restrict retarray,
for (n = 0; n < count; n++)
for (x = 0; x < xcount; x++)
/* dest[x,y] += a[x,n] * b[n,y] */
- dest[x*rxstride + y*rystride] += abase[x*axstride + n*aystride] * bbase[n*bxstride + y*bystride];
+ dest[x*rxstride + y*rystride] +=
+ abase[x*axstride + n*aystride] *
+ bbase[n*bxstride + y*bystride];
}
else if (GFC_DESCRIPTOR_RANK (a) == 1)
{
@@ -372,5 +610,5 @@ matmul_r16 (gfc_array_r16 * const restrict retarray,
}
}
}
-
+#pragma GCC reset_options
#endif
diff --git a/libgfortran/generated/matmul_r4.c b/libgfortran/generated/matmul_r4.c
index ff3b93f..6b6ad9b 100644
--- a/libgfortran/generated/matmul_r4.c
+++ b/libgfortran/generated/matmul_r4.c
@@ -32,7 +32,7 @@ see the files COPYING3 and COPYING.RUNTIME respectively. If not, see
#if defined (HAVE_GFC_REAL_4)
/* Prototype for the BLAS ?gemm subroutine, a pointer to which can be
- passed to us by the front-end, in which case we'll call it for large
+ passed to us by the front-end, in which case we call it for large
matrices. */
typedef void (*blas_call)(const char *, const char *, const int *, const int *,
@@ -75,6 +75,9 @@ extern void matmul_r4 (gfc_array_r4 * const restrict retarray,
int blas_limit, blas_call gemm);
export_proto(matmul_r4);
+#pragma GCC optimize ( "-Ofast" )
+#pragma GCC optimize ( "-funroll-loops" )
+
void
matmul_r4 (gfc_array_r4 * const restrict retarray,
gfc_array_r4 * const restrict a, gfc_array_r4 * const restrict b, int try_blas,
@@ -99,7 +102,7 @@ matmul_r4 (gfc_array_r4 * const restrict retarray,
o One-dimensional argument B is implicitly treated as a column matrix
dimensioned [count, 1], so ycount=1.
- */
+*/
if (retarray->base_addr == NULL)
{
@@ -127,47 +130,47 @@ matmul_r4 (gfc_array_r4 * const restrict retarray,
= xmallocarray (size0 ((array_t *) retarray), sizeof (GFC_REAL_4));
retarray->offset = 0;
}
- else if (unlikely (compile_options.bounds_check))
- {
- index_type ret_extent, arg_extent;
-
- if (GFC_DESCRIPTOR_RANK (a) == 1)
- {
- arg_extent = GFC_DESCRIPTOR_EXTENT(b,1);
- ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0);
- if (arg_extent != ret_extent)
- runtime_error ("Incorrect extent in return array in"
- " MATMUL intrinsic: is %ld, should be %ld",
- (long int) ret_extent, (long int) arg_extent);
- }
- else if (GFC_DESCRIPTOR_RANK (b) == 1)
- {
- arg_extent = GFC_DESCRIPTOR_EXTENT(a,0);
- ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0);
- if (arg_extent != ret_extent)
- runtime_error ("Incorrect extent in return array in"
- " MATMUL intrinsic: is %ld, should be %ld",
- (long int) ret_extent, (long int) arg_extent);
- }
- else
- {
- arg_extent = GFC_DESCRIPTOR_EXTENT(a,0);
- ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0);
- if (arg_extent != ret_extent)
- runtime_error ("Incorrect extent in return array in"
- " MATMUL intrinsic for dimension 1:"
- " is %ld, should be %ld",
- (long int) ret_extent, (long int) arg_extent);
-
- arg_extent = GFC_DESCRIPTOR_EXTENT(b,1);
- ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,1);
- if (arg_extent != ret_extent)
- runtime_error ("Incorrect extent in return array in"
- " MATMUL intrinsic for dimension 2:"
- " is %ld, should be %ld",
- (long int) ret_extent, (long int) arg_extent);
- }
- }
+ else if (unlikely (compile_options.bounds_check))
+ {
+ index_type ret_extent, arg_extent;
+
+ if (GFC_DESCRIPTOR_RANK (a) == 1)
+ {
+ arg_extent = GFC_DESCRIPTOR_EXTENT(b,1);
+ ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0);
+ if (arg_extent != ret_extent)
+ runtime_error ("Incorrect extent in return array in"
+ " MATMUL intrinsic: is %ld, should be %ld",
+ (long int) ret_extent, (long int) arg_extent);
+ }
+ else if (GFC_DESCRIPTOR_RANK (b) == 1)
+ {
+ arg_extent = GFC_DESCRIPTOR_EXTENT(a,0);
+ ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0);
+ if (arg_extent != ret_extent)
+ runtime_error ("Incorrect extent in return array in"
+ " MATMUL intrinsic: is %ld, should be %ld",
+ (long int) ret_extent, (long int) arg_extent);
+ }
+ else
+ {
+ arg_extent = GFC_DESCRIPTOR_EXTENT(a,0);
+ ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0);
+ if (arg_extent != ret_extent)
+ runtime_error ("Incorrect extent in return array in"
+ " MATMUL intrinsic for dimension 1:"
+ " is %ld, should be %ld",
+ (long int) ret_extent, (long int) arg_extent);
+
+ arg_extent = GFC_DESCRIPTOR_EXTENT(b,1);
+ ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,1);
+ if (arg_extent != ret_extent)
+ runtime_error ("Incorrect extent in return array in"
+ " MATMUL intrinsic for dimension 2:"
+ " is %ld, should be %ld",
+ (long int) ret_extent, (long int) arg_extent);
+ }
+ }
if (GFC_DESCRIPTOR_RANK (retarray) == 1)
@@ -230,61 +233,294 @@ matmul_r4 (gfc_array_r4 * const restrict retarray,
bbase = b->base_addr;
dest = retarray->base_addr;
-
- /* Now that everything is set up, we're performing the multiplication
+ /* Now that everything is set up, we perform the multiplication
itself. */
#define POW3(x) (((float) (x)) * ((float) (x)) * ((float) (x)))
+#define min(a,b) ((a) <= (b) ? (a) : (b))
+#define max(a,b) ((a) >= (b) ? (a) : (b))
if (try_blas && rxstride == 1 && (axstride == 1 || aystride == 1)
&& (bxstride == 1 || bystride == 1)
&& (((float) xcount) * ((float) ycount) * ((float) count)
> POW3(blas_limit)))
- {
- const int m = xcount, n = ycount, k = count, ldc = rystride;
- const GFC_REAL_4 one = 1, zero = 0;
- const int lda = (axstride == 1) ? aystride : axstride,
- ldb = (bxstride == 1) ? bystride : bxstride;
-
- if (lda > 0 && ldb > 0 && ldc > 0 && m > 1 && n > 1 && k > 1)
- {
- assert (gemm != NULL);
- gemm (axstride == 1 ? "N" : "T", bxstride == 1 ? "N" : "T", &m, &n, &k,
- &one, abase, &lda, bbase, &ldb, &zero, dest, &ldc, 1, 1);
- return;
- }
- }
-
- if (rxstride == 1 && axstride == 1 && bxstride == 1)
{
- const GFC_REAL_4 * restrict bbase_y;
- GFC_REAL_4 * restrict dest_y;
- const GFC_REAL_4 * restrict abase_n;
- GFC_REAL_4 bbase_yn;
+ const int m = xcount, n = ycount, k = count, ldc = rystride;
+ const GFC_REAL_4 one = 1, zero = 0;
+ const int lda = (axstride == 1) ? aystride : axstride,
+ ldb = (bxstride == 1) ? bystride : bxstride;
- if (rystride == xcount)
- memset (dest, 0, (sizeof (GFC_REAL_4) * xcount * ycount));
- else
+ if (lda > 0 && ldb > 0 && ldc > 0 && m > 1 && n > 1 && k > 1)
{
- for (y = 0; y < ycount; y++)
- for (x = 0; x < xcount; x++)
- dest[x + y*rystride] = (GFC_REAL_4)0;
+ assert (gemm != NULL);
+ gemm (axstride == 1 ? "N" : "T", bxstride == 1 ? "N" : "T", &m,
+ &n, &k, &one, abase, &lda, bbase, &ldb, &zero, dest,
+ &ldc, 1, 1);
+ return;
}
+ }
- for (y = 0; y < ycount; y++)
+ if (rxstride == 1 && axstride == 1 && bxstride == 1)
+ {
+ /* This block of code implements a tuned matmul, derived from
+ Superscalar GEMM-based level 3 BLAS, Beta version 0.1
+
+ Bo Kagstrom and Per Ling
+ Department of Computing Science
+ Umea University
+ S-901 87 Umea, Sweden
+
+ from netlib.org, translated to C, and modified for matmul.m4. */
+
+ const GFC_REAL_4 *a, *b;
+ GFC_REAL_4 *c;
+ const index_type m = xcount, n = ycount, k = count;
+
+ /* System generated locals */
+ index_type a_dim1, a_offset, b_dim1, b_offset, c_dim1, c_offset, i1, i2,
+ i3, i4, i5, i6;
+
+ /* Local variables */
+ GFC_REAL_4 t1[65536], /* was [256][256] */
+ f11, f12, f21, f22, f31, f32, f41, f42,
+ f13, f14, f23, f24, f33, f34, f43, f44;
+ index_type i, j, l, ii, jj, ll;
+ index_type isec, jsec, lsec, uisec, ujsec, ulsec;
+
+ a = abase;
+ b = bbase;
+ c = retarray->base_addr;
+
+ /* Parameter adjustments */
+ c_dim1 = m;
+ c_offset = 1 + c_dim1;
+ c -= c_offset;
+ a_dim1 = aystride;
+ a_offset = 1 + a_dim1;
+ a -= a_offset;
+ b_dim1 = bystride;
+ b_offset = 1 + b_dim1;
+ b -= b_offset;
+
+ /* Early exit if possible */
+ if (m == 0 || n == 0 || k == 0)
+ return;
+
+ /* Empty c first. */
+ for (j=1; j<=n; j++)
+ for (i=1; i<=m; i++)
+ c[i + j * c_dim1] = (GFC_REAL_4)0;
+
+ /* Start turning the crank. */
+ i1 = n;
+ for (jj = 1; jj <= i1; jj += 512)
{
- bbase_y = bbase + y*bystride;
- dest_y = dest + y*rystride;
- for (n = 0; n < count; n++)
+ /* Computing MIN */
+ i2 = 512;
+ i3 = n - jj + 1;
+ jsec = min(i2,i3);
+ ujsec = jsec - jsec % 4;
+ i2 = k;
+ for (ll = 1; ll <= i2; ll += 256)
{
- abase_n = abase + n*aystride;
- bbase_yn = bbase_y[n];
- for (x = 0; x < xcount; x++)
+ /* Computing MIN */
+ i3 = 256;
+ i4 = k - ll + 1;
+ lsec = min(i3,i4);
+ ulsec = lsec - lsec % 2;
+
+ i3 = m;
+ for (ii = 1; ii <= i3; ii += 256)
{
- dest_y[x] += abase_n[x] * bbase_yn;
+ /* Computing MIN */
+ i4 = 256;
+ i5 = m - ii + 1;
+ isec = min(i4,i5);
+ uisec = isec - isec % 2;
+ i4 = ll + ulsec - 1;
+ for (l = ll; l <= i4; l += 2)
+ {
+ i5 = ii + uisec - 1;
+ for (i = ii; i <= i5; i += 2)
+ {
+ t1[l - ll + 1 + ((i - ii + 1) << 8) - 257] =
+ a[i + l * a_dim1];
+ t1[l - ll + 2 + ((i - ii + 1) << 8) - 257] =
+ a[i + (l + 1) * a_dim1];
+ t1[l - ll + 1 + ((i - ii + 2) << 8) - 257] =
+ a[i + 1 + l * a_dim1];
+ t1[l - ll + 2 + ((i - ii + 2) << 8) - 257] =
+ a[i + 1 + (l + 1) * a_dim1];
+ }
+ if (uisec < isec)
+ {
+ t1[l - ll + 1 + (isec << 8) - 257] =
+ a[ii + isec - 1 + l * a_dim1];
+ t1[l - ll + 2 + (isec << 8) - 257] =
+ a[ii + isec - 1 + (l + 1) * a_dim1];
+ }
+ }
+ if (ulsec < lsec)
+ {
+ i4 = ii + isec - 1;
+ for (i = ii; i<= i4; ++i)
+ {
+ t1[lsec + ((i - ii + 1) << 8) - 257] =
+ a[i + (ll + lsec - 1) * a_dim1];
+ }
+ }
+
+ uisec = isec - isec % 4;
+ i4 = jj + ujsec - 1;
+ for (j = jj; j <= i4; j += 4)
+ {
+ i5 = ii + uisec - 1;
+ for (i = ii; i <= i5; i += 4)
+ {
+ f11 = c[i + j * c_dim1];
+ f21 = c[i + 1 + j * c_dim1];
+ f12 = c[i + (j + 1) * c_dim1];
+ f22 = c[i + 1 + (j + 1) * c_dim1];
+ f13 = c[i + (j + 2) * c_dim1];
+ f23 = c[i + 1 + (j + 2) * c_dim1];
+ f14 = c[i + (j + 3) * c_dim1];
+ f24 = c[i + 1 + (j + 3) * c_dim1];
+ f31 = c[i + 2 + j * c_dim1];
+ f41 = c[i + 3 + j * c_dim1];
+ f32 = c[i + 2 + (j + 1) * c_dim1];
+ f42 = c[i + 3 + (j + 1) * c_dim1];
+ f33 = c[i + 2 + (j + 2) * c_dim1];
+ f43 = c[i + 3 + (j + 2) * c_dim1];
+ f34 = c[i + 2 + (j + 3) * c_dim1];
+ f44 = c[i + 3 + (j + 3) * c_dim1];
+ i6 = ll + lsec - 1;
+ for (l = ll; l <= i6; ++l)
+ {
+ f11 += t1[l - ll + 1 + ((i - ii + 1) << 8) - 257]
+ * b[l + j * b_dim1];
+ f21 += t1[l - ll + 1 + ((i - ii + 2) << 8) - 257]
+ * b[l + j * b_dim1];
+ f12 += t1[l - ll + 1 + ((i - ii + 1) << 8) - 257]
+ * b[l + (j + 1) * b_dim1];
+ f22 += t1[l - ll + 1 + ((i - ii + 2) << 8) - 257]
+ * b[l + (j + 1) * b_dim1];
+ f13 += t1[l - ll + 1 + ((i - ii + 1) << 8) - 257]
+ * b[l + (j + 2) * b_dim1];
+ f23 += t1[l - ll + 1 + ((i - ii + 2) << 8) - 257]
+ * b[l + (j + 2) * b_dim1];
+ f14 += t1[l - ll + 1 + ((i - ii + 1) << 8) - 257]
+ * b[l + (j + 3) * b_dim1];
+ f24 += t1[l - ll + 1 + ((i - ii + 2) << 8) - 257]
+ * b[l + (j + 3) * b_dim1];
+ f31 += t1[l - ll + 1 + ((i - ii + 3) << 8) - 257]
+ * b[l + j * b_dim1];
+ f41 += t1[l - ll + 1 + ((i - ii + 4) << 8) - 257]
+ * b[l + j * b_dim1];
+ f32 += t1[l - ll + 1 + ((i - ii + 3) << 8) - 257]
+ * b[l + (j + 1) * b_dim1];
+ f42 += t1[l - ll + 1 + ((i - ii + 4) << 8) - 257]
+ * b[l + (j + 1) * b_dim1];
+ f33 += t1[l - ll + 1 + ((i - ii + 3) << 8) - 257]
+ * b[l + (j + 2) * b_dim1];
+ f43 += t1[l - ll + 1 + ((i - ii + 4) << 8) - 257]
+ * b[l + (j + 2) * b_dim1];
+ f34 += t1[l - ll + 1 + ((i - ii + 3) << 8) - 257]
+ * b[l + (j + 3) * b_dim1];
+ f44 += t1[l - ll + 1 + ((i - ii + 4) << 8) - 257]
+ * b[l + (j + 3) * b_dim1];
+ }
+ c[i + j * c_dim1] = f11;
+ c[i + 1 + j * c_dim1] = f21;
+ c[i + (j + 1) * c_dim1] = f12;
+ c[i + 1 + (j + 1) * c_dim1] = f22;
+ c[i + (j + 2) * c_dim1] = f13;
+ c[i + 1 + (j + 2) * c_dim1] = f23;
+ c[i + (j + 3) * c_dim1] = f14;
+ c[i + 1 + (j + 3) * c_dim1] = f24;
+ c[i + 2 + j * c_dim1] = f31;
+ c[i + 3 + j * c_dim1] = f41;
+ c[i + 2 + (j + 1) * c_dim1] = f32;
+ c[i + 3 + (j + 1) * c_dim1] = f42;
+ c[i + 2 + (j + 2) * c_dim1] = f33;
+ c[i + 3 + (j + 2) * c_dim1] = f43;
+ c[i + 2 + (j + 3) * c_dim1] = f34;
+ c[i + 3 + (j + 3) * c_dim1] = f44;
+ }
+ if (uisec < isec)
+ {
+ i5 = ii + isec - 1;
+ for (i = ii + uisec; i <= i5; ++i)
+ {
+ f11 = c[i + j * c_dim1];
+ f12 = c[i + (j + 1) * c_dim1];
+ f13 = c[i + (j + 2) * c_dim1];
+ f14 = c[i + (j + 3) * c_dim1];
+ i6 = ll + lsec - 1;
+ for (l = ll; l <= i6; ++l)
+ {
+ f11 += t1[l - ll + 1 + ((i - ii + 1) << 8) -
+ 257] * b[l + j * b_dim1];
+ f12 += t1[l - ll + 1 + ((i - ii + 1) << 8) -
+ 257] * b[l + (j + 1) * b_dim1];
+ f13 += t1[l - ll + 1 + ((i - ii + 1) << 8) -
+ 257] * b[l + (j + 2) * b_dim1];
+ f14 += t1[l - ll + 1 + ((i - ii + 1) << 8) -
+ 257] * b[l + (j + 3) * b_dim1];
+ }
+ c[i + j * c_dim1] = f11;
+ c[i + (j + 1) * c_dim1] = f12;
+ c[i + (j + 2) * c_dim1] = f13;
+ c[i + (j + 3) * c_dim1] = f14;
+ }
+ }
+ }
+ if (ujsec < jsec)
+ {
+ i4 = jj + jsec - 1;
+ for (j = jj + ujsec; j <= i4; ++j)
+ {
+ i5 = ii + uisec - 1;
+ for (i = ii; i <= i5; i += 4)
+ {
+ f11 = c[i + j * c_dim1];
+ f21 = c[i + 1 + j * c_dim1];
+ f31 = c[i + 2 + j * c_dim1];
+ f41 = c[i + 3 + j * c_dim1];
+ i6 = ll + lsec - 1;
+ for (l = ll; l <= i6; ++l)
+ {
+ f11 += t1[l - ll + 1 + ((i - ii + 1) << 8) -
+ 257] * b[l + j * b_dim1];
+ f21 += t1[l - ll + 1 + ((i - ii + 2) << 8) -
+ 257] * b[l + j * b_dim1];
+ f31 += t1[l - ll + 1 + ((i - ii + 3) << 8) -
+ 257] * b[l + j * b_dim1];
+ f41 += t1[l - ll + 1 + ((i - ii + 4) << 8) -
+ 257] * b[l + j * b_dim1];
+ }
+ c[i + j * c_dim1] = f11;
+ c[i + 1 + j * c_dim1] = f21;
+ c[i + 2 + j * c_dim1] = f31;
+ c[i + 3 + j * c_dim1] = f41;
+ }
+ i5 = ii + isec - 1;
+ for (i = ii + uisec; i <= i5; ++i)
+ {
+ f11 = c[i + j * c_dim1];
+ i6 = ll + lsec - 1;
+ for (l = ll; l <= i6; ++l)
+ {
+ f11 += t1[l - ll + 1 + ((i - ii + 1) << 8) -
+ 257] * b[l + j * b_dim1];
+ }
+ c[i + j * c_dim1] = f11;
+ }
+ }
+ }
}
}
}
+ return;
}
else if (rxstride == 1 && aystride == 1 && bxstride == 1)
{
@@ -334,7 +570,9 @@ matmul_r4 (gfc_array_r4 * const restrict retarray,
for (n = 0; n < count; n++)
for (x = 0; x < xcount; x++)
/* dest[x,y] += a[x,n] * b[n,y] */
- dest[x*rxstride + y*rystride] += abase[x*axstride + n*aystride] * bbase[n*bxstride + y*bystride];
+ dest[x*rxstride + y*rystride] +=
+ abase[x*axstride + n*aystride] *
+ bbase[n*bxstride + y*bystride];
}
else if (GFC_DESCRIPTOR_RANK (a) == 1)
{
@@ -372,5 +610,5 @@ matmul_r4 (gfc_array_r4 * const restrict retarray,
}
}
}
-
+#pragma GCC reset_options
#endif
diff --git a/libgfortran/generated/matmul_r8.c b/libgfortran/generated/matmul_r8.c
index af805ee..f3d0149 100644
--- a/libgfortran/generated/matmul_r8.c
+++ b/libgfortran/generated/matmul_r8.c
@@ -32,7 +32,7 @@ see the files COPYING3 and COPYING.RUNTIME respectively. If not, see
#if defined (HAVE_GFC_REAL_8)
/* Prototype for the BLAS ?gemm subroutine, a pointer to which can be
- passed to us by the front-end, in which case we'll call it for large
+ passed to us by the front-end, in which case we call it for large
matrices. */
typedef void (*blas_call)(const char *, const char *, const int *, const int *,
@@ -75,6 +75,9 @@ extern void matmul_r8 (gfc_array_r8 * const restrict retarray,
int blas_limit, blas_call gemm);
export_proto(matmul_r8);
+#pragma GCC optimize ( "-Ofast" )
+#pragma GCC optimize ( "-funroll-loops" )
+
void
matmul_r8 (gfc_array_r8 * const restrict retarray,
gfc_array_r8 * const restrict a, gfc_array_r8 * const restrict b, int try_blas,
@@ -99,7 +102,7 @@ matmul_r8 (gfc_array_r8 * const restrict retarray,
o One-dimensional argument B is implicitly treated as a column matrix
dimensioned [count, 1], so ycount=1.
- */
+*/
if (retarray->base_addr == NULL)
{
@@ -127,47 +130,47 @@ matmul_r8 (gfc_array_r8 * const restrict retarray,
= xmallocarray (size0 ((array_t *) retarray), sizeof (GFC_REAL_8));
retarray->offset = 0;
}
- else if (unlikely (compile_options.bounds_check))
- {
- index_type ret_extent, arg_extent;
-
- if (GFC_DESCRIPTOR_RANK (a) == 1)
- {
- arg_extent = GFC_DESCRIPTOR_EXTENT(b,1);
- ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0);
- if (arg_extent != ret_extent)
- runtime_error ("Incorrect extent in return array in"
- " MATMUL intrinsic: is %ld, should be %ld",
- (long int) ret_extent, (long int) arg_extent);
- }
- else if (GFC_DESCRIPTOR_RANK (b) == 1)
- {
- arg_extent = GFC_DESCRIPTOR_EXTENT(a,0);
- ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0);
- if (arg_extent != ret_extent)
- runtime_error ("Incorrect extent in return array in"
- " MATMUL intrinsic: is %ld, should be %ld",
- (long int) ret_extent, (long int) arg_extent);
- }
- else
- {
- arg_extent = GFC_DESCRIPTOR_EXTENT(a,0);
- ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0);
- if (arg_extent != ret_extent)
- runtime_error ("Incorrect extent in return array in"
- " MATMUL intrinsic for dimension 1:"
- " is %ld, should be %ld",
- (long int) ret_extent, (long int) arg_extent);
-
- arg_extent = GFC_DESCRIPTOR_EXTENT(b,1);
- ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,1);
- if (arg_extent != ret_extent)
- runtime_error ("Incorrect extent in return array in"
- " MATMUL intrinsic for dimension 2:"
- " is %ld, should be %ld",
- (long int) ret_extent, (long int) arg_extent);
- }
- }
+ else if (unlikely (compile_options.bounds_check))
+ {
+ index_type ret_extent, arg_extent;
+
+ if (GFC_DESCRIPTOR_RANK (a) == 1)
+ {
+ arg_extent = GFC_DESCRIPTOR_EXTENT(b,1);
+ ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0);
+ if (arg_extent != ret_extent)
+ runtime_error ("Incorrect extent in return array in"
+ " MATMUL intrinsic: is %ld, should be %ld",
+ (long int) ret_extent, (long int) arg_extent);
+ }
+ else if (GFC_DESCRIPTOR_RANK (b) == 1)
+ {
+ arg_extent = GFC_DESCRIPTOR_EXTENT(a,0);
+ ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0);
+ if (arg_extent != ret_extent)
+ runtime_error ("Incorrect extent in return array in"
+ " MATMUL intrinsic: is %ld, should be %ld",
+ (long int) ret_extent, (long int) arg_extent);
+ }
+ else
+ {
+ arg_extent = GFC_DESCRIPTOR_EXTENT(a,0);
+ ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0);
+ if (arg_extent != ret_extent)
+ runtime_error ("Incorrect extent in return array in"
+ " MATMUL intrinsic for dimension 1:"
+ " is %ld, should be %ld",
+ (long int) ret_extent, (long int) arg_extent);
+
+ arg_extent = GFC_DESCRIPTOR_EXTENT(b,1);
+ ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,1);
+ if (arg_extent != ret_extent)
+ runtime_error ("Incorrect extent in return array in"
+ " MATMUL intrinsic for dimension 2:"
+ " is %ld, should be %ld",
+ (long int) ret_extent, (long int) arg_extent);
+ }
+ }
if (GFC_DESCRIPTOR_RANK (retarray) == 1)
@@ -230,61 +233,294 @@ matmul_r8 (gfc_array_r8 * const restrict retarray,
bbase = b->base_addr;
dest = retarray->base_addr;
-
- /* Now that everything is set up, we're performing the multiplication
+ /* Now that everything is set up, we perform the multiplication
itself. */
#define POW3(x) (((float) (x)) * ((float) (x)) * ((float) (x)))
+#define min(a,b) ((a) <= (b) ? (a) : (b))
+#define max(a,b) ((a) >= (b) ? (a) : (b))
if (try_blas && rxstride == 1 && (axstride == 1 || aystride == 1)
&& (bxstride == 1 || bystride == 1)
&& (((float) xcount) * ((float) ycount) * ((float) count)
> POW3(blas_limit)))
- {
- const int m = xcount, n = ycount, k = count, ldc = rystride;
- const GFC_REAL_8 one = 1, zero = 0;
- const int lda = (axstride == 1) ? aystride : axstride,
- ldb = (bxstride == 1) ? bystride : bxstride;
-
- if (lda > 0 && ldb > 0 && ldc > 0 && m > 1 && n > 1 && k > 1)
- {
- assert (gemm != NULL);
- gemm (axstride == 1 ? "N" : "T", bxstride == 1 ? "N" : "T", &m, &n, &k,
- &one, abase, &lda, bbase, &ldb, &zero, dest, &ldc, 1, 1);
- return;
- }
- }
-
- if (rxstride == 1 && axstride == 1 && bxstride == 1)
{
- const GFC_REAL_8 * restrict bbase_y;
- GFC_REAL_8 * restrict dest_y;
- const GFC_REAL_8 * restrict abase_n;
- GFC_REAL_8 bbase_yn;
+ const int m = xcount, n = ycount, k = count, ldc = rystride;
+ const GFC_REAL_8 one = 1, zero = 0;
+ const int lda = (axstride == 1) ? aystride : axstride,
+ ldb = (bxstride == 1) ? bystride : bxstride;
- if (rystride == xcount)
- memset (dest, 0, (sizeof (GFC_REAL_8) * xcount * ycount));
- else
+ if (lda > 0 && ldb > 0 && ldc > 0 && m > 1 && n > 1 && k > 1)
{
- for (y = 0; y < ycount; y++)
- for (x = 0; x < xcount; x++)
- dest[x + y*rystride] = (GFC_REAL_8)0;
+ assert (gemm != NULL);
+ gemm (axstride == 1 ? "N" : "T", bxstride == 1 ? "N" : "T", &m,
+ &n, &k, &one, abase, &lda, bbase, &ldb, &zero, dest,
+ &ldc, 1, 1);
+ return;
}
+ }
- for (y = 0; y < ycount; y++)
+ if (rxstride == 1 && axstride == 1 && bxstride == 1)
+ {
+ /* This block of code implements a tuned matmul, derived from
+ Superscalar GEMM-based level 3 BLAS, Beta version 0.1
+
+ Bo Kagstrom and Per Ling
+ Department of Computing Science
+ Umea University
+ S-901 87 Umea, Sweden
+
+ from netlib.org, translated to C, and modified for matmul.m4. */
+
+ const GFC_REAL_8 *a, *b;
+ GFC_REAL_8 *c;
+ const index_type m = xcount, n = ycount, k = count;
+
+ /* System generated locals */
+ index_type a_dim1, a_offset, b_dim1, b_offset, c_dim1, c_offset, i1, i2,
+ i3, i4, i5, i6;
+
+ /* Local variables */
+ GFC_REAL_8 t1[65536], /* was [256][256] */
+ f11, f12, f21, f22, f31, f32, f41, f42,
+ f13, f14, f23, f24, f33, f34, f43, f44;
+ index_type i, j, l, ii, jj, ll;
+ index_type isec, jsec, lsec, uisec, ujsec, ulsec;
+
+ a = abase;
+ b = bbase;
+ c = retarray->base_addr;
+
+ /* Parameter adjustments */
+ c_dim1 = m;
+ c_offset = 1 + c_dim1;
+ c -= c_offset;
+ a_dim1 = aystride;
+ a_offset = 1 + a_dim1;
+ a -= a_offset;
+ b_dim1 = bystride;
+ b_offset = 1 + b_dim1;
+ b -= b_offset;
+
+ /* Early exit if possible */
+ if (m == 0 || n == 0 || k == 0)
+ return;
+
+ /* Empty c first. */
+ for (j=1; j<=n; j++)
+ for (i=1; i<=m; i++)
+ c[i + j * c_dim1] = (GFC_REAL_8)0;
+
+ /* Start turning the crank. */
+ i1 = n;
+ for (jj = 1; jj <= i1; jj += 512)
{
- bbase_y = bbase + y*bystride;
- dest_y = dest + y*rystride;
- for (n = 0; n < count; n++)
+ /* Computing MIN */
+ i2 = 512;
+ i3 = n - jj + 1;
+ jsec = min(i2,i3);
+ ujsec = jsec - jsec % 4;
+ i2 = k;
+ for (ll = 1; ll <= i2; ll += 256)
{
- abase_n = abase + n*aystride;
- bbase_yn = bbase_y[n];
- for (x = 0; x < xcount; x++)
+ /* Computing MIN */
+ i3 = 256;
+ i4 = k - ll + 1;
+ lsec = min(i3,i4);
+ ulsec = lsec - lsec % 2;
+
+ i3 = m;
+ for (ii = 1; ii <= i3; ii += 256)
{
- dest_y[x] += abase_n[x] * bbase_yn;
+ /* Computing MIN */
+ i4 = 256;
+ i5 = m - ii + 1;
+ isec = min(i4,i5);
+ uisec = isec - isec % 2;
+ i4 = ll + ulsec - 1;
+ for (l = ll; l <= i4; l += 2)
+ {
+ i5 = ii + uisec - 1;
+ for (i = ii; i <= i5; i += 2)
+ {
+ t1[l - ll + 1 + ((i - ii + 1) << 8) - 257] =
+ a[i + l * a_dim1];
+ t1[l - ll + 2 + ((i - ii + 1) << 8) - 257] =
+ a[i + (l + 1) * a_dim1];
+ t1[l - ll + 1 + ((i - ii + 2) << 8) - 257] =
+ a[i + 1 + l * a_dim1];
+ t1[l - ll + 2 + ((i - ii + 2) << 8) - 257] =
+ a[i + 1 + (l + 1) * a_dim1];
+ }
+ if (uisec < isec)
+ {
+ t1[l - ll + 1 + (isec << 8) - 257] =
+ a[ii + isec - 1 + l * a_dim1];
+ t1[l - ll + 2 + (isec << 8) - 257] =
+ a[ii + isec - 1 + (l + 1) * a_dim1];
+ }
+ }
+ if (ulsec < lsec)
+ {
+ i4 = ii + isec - 1;
+ for (i = ii; i<= i4; ++i)
+ {
+ t1[lsec + ((i - ii + 1) << 8) - 257] =
+ a[i + (ll + lsec - 1) * a_dim1];
+ }
+ }
+
+ uisec = isec - isec % 4;
+ i4 = jj + ujsec - 1;
+ for (j = jj; j <= i4; j += 4)
+ {
+ i5 = ii + uisec - 1;
+ for (i = ii; i <= i5; i += 4)
+ {
+ f11 = c[i + j * c_dim1];
+ f21 = c[i + 1 + j * c_dim1];
+ f12 = c[i + (j + 1) * c_dim1];
+ f22 = c[i + 1 + (j + 1) * c_dim1];
+ f13 = c[i + (j + 2) * c_dim1];
+ f23 = c[i + 1 + (j + 2) * c_dim1];
+ f14 = c[i + (j + 3) * c_dim1];
+ f24 = c[i + 1 + (j + 3) * c_dim1];
+ f31 = c[i + 2 + j * c_dim1];
+ f41 = c[i + 3 + j * c_dim1];
+ f32 = c[i + 2 + (j + 1) * c_dim1];
+ f42 = c[i + 3 + (j + 1) * c_dim1];
+ f33 = c[i + 2 + (j + 2) * c_dim1];
+ f43 = c[i + 3 + (j + 2) * c_dim1];
+ f34 = c[i + 2 + (j + 3) * c_dim1];
+ f44 = c[i + 3 + (j + 3) * c_dim1];
+ i6 = ll + lsec - 1;
+ for (l = ll; l <= i6; ++l)
+ {
+ f11 += t1[l - ll + 1 + ((i - ii + 1) << 8) - 257]
+ * b[l + j * b_dim1];
+ f21 += t1[l - ll + 1 + ((i - ii + 2) << 8) - 257]
+ * b[l + j * b_dim1];
+ f12 += t1[l - ll + 1 + ((i - ii + 1) << 8) - 257]
+ * b[l + (j + 1) * b_dim1];
+ f22 += t1[l - ll + 1 + ((i - ii + 2) << 8) - 257]
+ * b[l + (j + 1) * b_dim1];
+ f13 += t1[l - ll + 1 + ((i - ii + 1) << 8) - 257]
+ * b[l + (j + 2) * b_dim1];
+ f23 += t1[l - ll + 1 + ((i - ii + 2) << 8) - 257]
+ * b[l + (j + 2) * b_dim1];
+ f14 += t1[l - ll + 1 + ((i - ii + 1) << 8) - 257]
+ * b[l + (j + 3) * b_dim1];
+ f24 += t1[l - ll + 1 + ((i - ii + 2) << 8) - 257]
+ * b[l + (j + 3) * b_dim1];
+ f31 += t1[l - ll + 1 + ((i - ii + 3) << 8) - 257]
+ * b[l + j * b_dim1];
+ f41 += t1[l - ll + 1 + ((i - ii + 4) << 8) - 257]
+ * b[l + j * b_dim1];
+ f32 += t1[l - ll + 1 + ((i - ii + 3) << 8) - 257]
+ * b[l + (j + 1) * b_dim1];
+ f42 += t1[l - ll + 1 + ((i - ii + 4) << 8) - 257]
+ * b[l + (j + 1) * b_dim1];
+ f33 += t1[l - ll + 1 + ((i - ii + 3) << 8) - 257]
+ * b[l + (j + 2) * b_dim1];
+ f43 += t1[l - ll + 1 + ((i - ii + 4) << 8) - 257]
+ * b[l + (j + 2) * b_dim1];
+ f34 += t1[l - ll + 1 + ((i - ii + 3) << 8) - 257]
+ * b[l + (j + 3) * b_dim1];
+ f44 += t1[l - ll + 1 + ((i - ii + 4) << 8) - 257]
+ * b[l + (j + 3) * b_dim1];
+ }
+ c[i + j * c_dim1] = f11;
+ c[i + 1 + j * c_dim1] = f21;
+ c[i + (j + 1) * c_dim1] = f12;
+ c[i + 1 + (j + 1) * c_dim1] = f22;
+ c[i + (j + 2) * c_dim1] = f13;
+ c[i + 1 + (j + 2) * c_dim1] = f23;
+ c[i + (j + 3) * c_dim1] = f14;
+ c[i + 1 + (j + 3) * c_dim1] = f24;
+ c[i + 2 + j * c_dim1] = f31;
+ c[i + 3 + j * c_dim1] = f41;
+ c[i + 2 + (j + 1) * c_dim1] = f32;
+ c[i + 3 + (j + 1) * c_dim1] = f42;
+ c[i + 2 + (j + 2) * c_dim1] = f33;
+ c[i + 3 + (j + 2) * c_dim1] = f43;
+ c[i + 2 + (j + 3) * c_dim1] = f34;
+ c[i + 3 + (j + 3) * c_dim1] = f44;
+ }
+ if (uisec < isec)
+ {
+ i5 = ii + isec - 1;
+ for (i = ii + uisec; i <= i5; ++i)
+ {
+ f11 = c[i + j * c_dim1];
+ f12 = c[i + (j + 1) * c_dim1];
+ f13 = c[i + (j + 2) * c_dim1];
+ f14 = c[i + (j + 3) * c_dim1];
+ i6 = ll + lsec - 1;
+ for (l = ll; l <= i6; ++l)
+ {
+ f11 += t1[l - ll + 1 + ((i - ii + 1) << 8) -
+ 257] * b[l + j * b_dim1];
+ f12 += t1[l - ll + 1 + ((i - ii + 1) << 8) -
+ 257] * b[l + (j + 1) * b_dim1];
+ f13 += t1[l - ll + 1 + ((i - ii + 1) << 8) -
+ 257] * b[l + (j + 2) * b_dim1];
+ f14 += t1[l - ll + 1 + ((i - ii + 1) << 8) -
+ 257] * b[l + (j + 3) * b_dim1];
+ }
+ c[i + j * c_dim1] = f11;
+ c[i + (j + 1) * c_dim1] = f12;
+ c[i + (j + 2) * c_dim1] = f13;
+ c[i + (j + 3) * c_dim1] = f14;
+ }
+ }
+ }
+ if (ujsec < jsec)
+ {
+ i4 = jj + jsec - 1;
+ for (j = jj + ujsec; j <= i4; ++j)
+ {
+ i5 = ii + uisec - 1;
+ for (i = ii; i <= i5; i += 4)
+ {
+ f11 = c[i + j * c_dim1];
+ f21 = c[i + 1 + j * c_dim1];
+ f31 = c[i + 2 + j * c_dim1];
+ f41 = c[i + 3 + j * c_dim1];
+ i6 = ll + lsec - 1;
+ for (l = ll; l <= i6; ++l)
+ {
+ f11 += t1[l - ll + 1 + ((i - ii + 1) << 8) -
+ 257] * b[l + j * b_dim1];
+ f21 += t1[l - ll + 1 + ((i - ii + 2) << 8) -
+ 257] * b[l + j * b_dim1];
+ f31 += t1[l - ll + 1 + ((i - ii + 3) << 8) -
+ 257] * b[l + j * b_dim1];
+ f41 += t1[l - ll + 1 + ((i - ii + 4) << 8) -
+ 257] * b[l + j * b_dim1];
+ }
+ c[i + j * c_dim1] = f11;
+ c[i + 1 + j * c_dim1] = f21;
+ c[i + 2 + j * c_dim1] = f31;
+ c[i + 3 + j * c_dim1] = f41;
+ }
+ i5 = ii + isec - 1;
+ for (i = ii + uisec; i <= i5; ++i)
+ {
+ f11 = c[i + j * c_dim1];
+ i6 = ll + lsec - 1;
+ for (l = ll; l <= i6; ++l)
+ {
+ f11 += t1[l - ll + 1 + ((i - ii + 1) << 8) -
+ 257] * b[l + j * b_dim1];
+ }
+ c[i + j * c_dim1] = f11;
+ }
+ }
+ }
}
}
}
+ return;
}
else if (rxstride == 1 && aystride == 1 && bxstride == 1)
{
@@ -334,7 +570,9 @@ matmul_r8 (gfc_array_r8 * const restrict retarray,
for (n = 0; n < count; n++)
for (x = 0; x < xcount; x++)
/* dest[x,y] += a[x,n] * b[n,y] */
- dest[x*rxstride + y*rystride] += abase[x*axstride + n*aystride] * bbase[n*bxstride + y*bystride];
+ dest[x*rxstride + y*rystride] +=
+ abase[x*axstride + n*aystride] *
+ bbase[n*bxstride + y*bystride];
}
else if (GFC_DESCRIPTOR_RANK (a) == 1)
{
@@ -372,5 +610,5 @@ matmul_r8 (gfc_array_r8 * const restrict retarray,
}
}
}
-
+#pragma GCC reset_options
#endif
diff --git a/libgfortran/m4/matmul.m4 b/libgfortran/m4/matmul.m4
index 468615b..20a0404 100644
--- a/libgfortran/m4/matmul.m4
+++ b/libgfortran/m4/matmul.m4
@@ -33,7 +33,7 @@ include(iparm.m4)dnl
`#if defined (HAVE_'rtype_name`)
/* Prototype for the BLAS ?gemm subroutine, a pointer to which can be
- passed to us by the front-end, in which case we''`ll call it for large
+ passed to us by the front-end, in which case we call it for large
matrices. */
typedef void (*blas_call)(const char *, const char *, const int *, const int *,
@@ -76,6 +76,9 @@ extern void matmul_'rtype_code` ('rtype` * const restrict retarray,
int blas_limit, blas_call gemm);
export_proto(matmul_'rtype_code`);
+#pragma GCC optimize ( "-Ofast" )
+#pragma GCC optimize ( "-funroll-loops" )
+
void
matmul_'rtype_code` ('rtype` * const restrict retarray,
'rtype` * const restrict a, 'rtype` * const restrict b, int try_blas,
@@ -100,7 +103,7 @@ matmul_'rtype_code` ('rtype` * const restrict retarray,
o One-dimensional argument B is implicitly treated as a column matrix
dimensioned [count, 1], so ycount=1.
- */
+*/
if (retarray->base_addr == NULL)
{
@@ -128,47 +131,47 @@ matmul_'rtype_code` ('rtype` * const restrict retarray,
= xmallocarray (size0 ((array_t *) retarray), sizeof ('rtype_name`));
retarray->offset = 0;
}
- else if (unlikely (compile_options.bounds_check))
- {
- index_type ret_extent, arg_extent;
-
- if (GFC_DESCRIPTOR_RANK (a) == 1)
- {
- arg_extent = GFC_DESCRIPTOR_EXTENT(b,1);
- ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0);
- if (arg_extent != ret_extent)
- runtime_error ("Incorrect extent in return array in"
- " MATMUL intrinsic: is %ld, should be %ld",
- (long int) ret_extent, (long int) arg_extent);
- }
- else if (GFC_DESCRIPTOR_RANK (b) == 1)
- {
- arg_extent = GFC_DESCRIPTOR_EXTENT(a,0);
- ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0);
- if (arg_extent != ret_extent)
- runtime_error ("Incorrect extent in return array in"
- " MATMUL intrinsic: is %ld, should be %ld",
- (long int) ret_extent, (long int) arg_extent);
- }
- else
- {
- arg_extent = GFC_DESCRIPTOR_EXTENT(a,0);
- ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0);
- if (arg_extent != ret_extent)
- runtime_error ("Incorrect extent in return array in"
- " MATMUL intrinsic for dimension 1:"
- " is %ld, should be %ld",
- (long int) ret_extent, (long int) arg_extent);
-
- arg_extent = GFC_DESCRIPTOR_EXTENT(b,1);
- ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,1);
- if (arg_extent != ret_extent)
- runtime_error ("Incorrect extent in return array in"
- " MATMUL intrinsic for dimension 2:"
- " is %ld, should be %ld",
- (long int) ret_extent, (long int) arg_extent);
- }
- }
+ else if (unlikely (compile_options.bounds_check))
+ {
+ index_type ret_extent, arg_extent;
+
+ if (GFC_DESCRIPTOR_RANK (a) == 1)
+ {
+ arg_extent = GFC_DESCRIPTOR_EXTENT(b,1);
+ ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0);
+ if (arg_extent != ret_extent)
+ runtime_error ("Incorrect extent in return array in"
+ " MATMUL intrinsic: is %ld, should be %ld",
+ (long int) ret_extent, (long int) arg_extent);
+ }
+ else if (GFC_DESCRIPTOR_RANK (b) == 1)
+ {
+ arg_extent = GFC_DESCRIPTOR_EXTENT(a,0);
+ ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0);
+ if (arg_extent != ret_extent)
+ runtime_error ("Incorrect extent in return array in"
+ " MATMUL intrinsic: is %ld, should be %ld",
+ (long int) ret_extent, (long int) arg_extent);
+ }
+ else
+ {
+ arg_extent = GFC_DESCRIPTOR_EXTENT(a,0);
+ ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,0);
+ if (arg_extent != ret_extent)
+ runtime_error ("Incorrect extent in return array in"
+ " MATMUL intrinsic for dimension 1:"
+ " is %ld, should be %ld",
+ (long int) ret_extent, (long int) arg_extent);
+
+ arg_extent = GFC_DESCRIPTOR_EXTENT(b,1);
+ ret_extent = GFC_DESCRIPTOR_EXTENT(retarray,1);
+ if (arg_extent != ret_extent)
+ runtime_error ("Incorrect extent in return array in"
+ " MATMUL intrinsic for dimension 2:"
+ " is %ld, should be %ld",
+ (long int) ret_extent, (long int) arg_extent);
+ }
+ }
'
sinclude(`matmul_asm_'rtype_code`.m4')dnl
`
@@ -232,61 +235,294 @@ sinclude(`matmul_asm_'rtype_code`.m4')dnl
bbase = b->base_addr;
dest = retarray->base_addr;
-
- /* Now that everything is set up, we''`re performing the multiplication
+ /* Now that everything is set up, we perform the multiplication
itself. */
#define POW3(x) (((float) (x)) * ((float) (x)) * ((float) (x)))
+#define min(a,b) ((a) <= (b) ? (a) : (b))
+#define max(a,b) ((a) >= (b) ? (a) : (b))
if (try_blas && rxstride == 1 && (axstride == 1 || aystride == 1)
&& (bxstride == 1 || bystride == 1)
&& (((float) xcount) * ((float) ycount) * ((float) count)
> POW3(blas_limit)))
- {
- const int m = xcount, n = ycount, k = count, ldc = rystride;
- const 'rtype_name` one = 1, zero = 0;
- const int lda = (axstride == 1) ? aystride : axstride,
- ldb = (bxstride == 1) ? bystride : bxstride;
-
- if (lda > 0 && ldb > 0 && ldc > 0 && m > 1 && n > 1 && k > 1)
- {
- assert (gemm != NULL);
- gemm (axstride == 1 ? "N" : "T", bxstride == 1 ? "N" : "T", &m, &n, &k,
- &one, abase, &lda, bbase, &ldb, &zero, dest, &ldc, 1, 1);
- return;
- }
- }
-
- if (rxstride == 1 && axstride == 1 && bxstride == 1)
{
- const 'rtype_name` * restrict bbase_y;
- 'rtype_name` * restrict dest_y;
- const 'rtype_name` * restrict abase_n;
- 'rtype_name` bbase_yn;
+ const int m = xcount, n = ycount, k = count, ldc = rystride;
+ const 'rtype_name` one = 1, zero = 0;
+ const int lda = (axstride == 1) ? aystride : axstride,
+ ldb = (bxstride == 1) ? bystride : bxstride;
- if (rystride == xcount)
- memset (dest, 0, (sizeof ('rtype_name`) * xcount * ycount));
- else
+ if (lda > 0 && ldb > 0 && ldc > 0 && m > 1 && n > 1 && k > 1)
{
- for (y = 0; y < ycount; y++)
- for (x = 0; x < xcount; x++)
- dest[x + y*rystride] = ('rtype_name`)0;
+ assert (gemm != NULL);
+ gemm (axstride == 1 ? "N" : "T", bxstride == 1 ? "N" : "T", &m,
+ &n, &k, &one, abase, &lda, bbase, &ldb, &zero, dest,
+ &ldc, 1, 1);
+ return;
}
+ }
- for (y = 0; y < ycount; y++)
+ if (rxstride == 1 && axstride == 1 && bxstride == 1)
+ {
+ /* This block of code implements a tuned matmul, derived from
+ Superscalar GEMM-based level 3 BLAS, Beta version 0.1
+
+ Bo Kagstrom and Per Ling
+ Department of Computing Science
+ Umea University
+ S-901 87 Umea, Sweden
+
+ from netlib.org, translated to C, and modified for matmul.m4. */
+
+ const 'rtype_name` *a, *b;
+ 'rtype_name` *c;
+ const index_type m = xcount, n = ycount, k = count;
+
+ /* System generated locals */
+ index_type a_dim1, a_offset, b_dim1, b_offset, c_dim1, c_offset, i1, i2,
+ i3, i4, i5, i6;
+
+ /* Local variables */
+ 'rtype_name` t1[65536], /* was [256][256] */
+ f11, f12, f21, f22, f31, f32, f41, f42,
+ f13, f14, f23, f24, f33, f34, f43, f44;
+ index_type i, j, l, ii, jj, ll;
+ index_type isec, jsec, lsec, uisec, ujsec, ulsec;
+
+ a = abase;
+ b = bbase;
+ c = retarray->base_addr;
+
+ /* Parameter adjustments */
+ c_dim1 = m;
+ c_offset = 1 + c_dim1;
+ c -= c_offset;
+ a_dim1 = aystride;
+ a_offset = 1 + a_dim1;
+ a -= a_offset;
+ b_dim1 = bystride;
+ b_offset = 1 + b_dim1;
+ b -= b_offset;
+
+ /* Early exit if possible */
+ if (m == 0 || n == 0 || k == 0)
+ return;
+
+ /* Empty c first. */
+ for (j=1; j<=n; j++)
+ for (i=1; i<=m; i++)
+ c[i + j * c_dim1] = ('rtype_name`)0;
+
+ /* Start turning the crank. */
+ i1 = n;
+ for (jj = 1; jj <= i1; jj += 512)
{
- bbase_y = bbase + y*bystride;
- dest_y = dest + y*rystride;
- for (n = 0; n < count; n++)
+ /* Computing MIN */
+ i2 = 512;
+ i3 = n - jj + 1;
+ jsec = min(i2,i3);
+ ujsec = jsec - jsec % 4;
+ i2 = k;
+ for (ll = 1; ll <= i2; ll += 256)
{
- abase_n = abase + n*aystride;
- bbase_yn = bbase_y[n];
- for (x = 0; x < xcount; x++)
+ /* Computing MIN */
+ i3 = 256;
+ i4 = k - ll + 1;
+ lsec = min(i3,i4);
+ ulsec = lsec - lsec % 2;
+
+ i3 = m;
+ for (ii = 1; ii <= i3; ii += 256)
{
- dest_y[x] += abase_n[x] * bbase_yn;
+ /* Computing MIN */
+ i4 = 256;
+ i5 = m - ii + 1;
+ isec = min(i4,i5);
+ uisec = isec - isec % 2;
+ i4 = ll + ulsec - 1;
+ for (l = ll; l <= i4; l += 2)
+ {
+ i5 = ii + uisec - 1;
+ for (i = ii; i <= i5; i += 2)
+ {
+ t1[l - ll + 1 + ((i - ii + 1) << 8) - 257] =
+ a[i + l * a_dim1];
+ t1[l - ll + 2 + ((i - ii + 1) << 8) - 257] =
+ a[i + (l + 1) * a_dim1];
+ t1[l - ll + 1 + ((i - ii + 2) << 8) - 257] =
+ a[i + 1 + l * a_dim1];
+ t1[l - ll + 2 + ((i - ii + 2) << 8) - 257] =
+ a[i + 1 + (l + 1) * a_dim1];
+ }
+ if (uisec < isec)
+ {
+ t1[l - ll + 1 + (isec << 8) - 257] =
+ a[ii + isec - 1 + l * a_dim1];
+ t1[l - ll + 2 + (isec << 8) - 257] =
+ a[ii + isec - 1 + (l + 1) * a_dim1];
+ }
+ }
+ if (ulsec < lsec)
+ {
+ i4 = ii + isec - 1;
+ for (i = ii; i<= i4; ++i)
+ {
+ t1[lsec + ((i - ii + 1) << 8) - 257] =
+ a[i + (ll + lsec - 1) * a_dim1];
+ }
+ }
+
+ uisec = isec - isec % 4;
+ i4 = jj + ujsec - 1;
+ for (j = jj; j <= i4; j += 4)
+ {
+ i5 = ii + uisec - 1;
+ for (i = ii; i <= i5; i += 4)
+ {
+ f11 = c[i + j * c_dim1];
+ f21 = c[i + 1 + j * c_dim1];
+ f12 = c[i + (j + 1) * c_dim1];
+ f22 = c[i + 1 + (j + 1) * c_dim1];
+ f13 = c[i + (j + 2) * c_dim1];
+ f23 = c[i + 1 + (j + 2) * c_dim1];
+ f14 = c[i + (j + 3) * c_dim1];
+ f24 = c[i + 1 + (j + 3) * c_dim1];
+ f31 = c[i + 2 + j * c_dim1];
+ f41 = c[i + 3 + j * c_dim1];
+ f32 = c[i + 2 + (j + 1) * c_dim1];
+ f42 = c[i + 3 + (j + 1) * c_dim1];
+ f33 = c[i + 2 + (j + 2) * c_dim1];
+ f43 = c[i + 3 + (j + 2) * c_dim1];
+ f34 = c[i + 2 + (j + 3) * c_dim1];
+ f44 = c[i + 3 + (j + 3) * c_dim1];
+ i6 = ll + lsec - 1;
+ for (l = ll; l <= i6; ++l)
+ {
+ f11 += t1[l - ll + 1 + ((i - ii + 1) << 8) - 257]
+ * b[l + j * b_dim1];
+ f21 += t1[l - ll + 1 + ((i - ii + 2) << 8) - 257]
+ * b[l + j * b_dim1];
+ f12 += t1[l - ll + 1 + ((i - ii + 1) << 8) - 257]
+ * b[l + (j + 1) * b_dim1];
+ f22 += t1[l - ll + 1 + ((i - ii + 2) << 8) - 257]
+ * b[l + (j + 1) * b_dim1];
+ f13 += t1[l - ll + 1 + ((i - ii + 1) << 8) - 257]
+ * b[l + (j + 2) * b_dim1];
+ f23 += t1[l - ll + 1 + ((i - ii + 2) << 8) - 257]
+ * b[l + (j + 2) * b_dim1];
+ f14 += t1[l - ll + 1 + ((i - ii + 1) << 8) - 257]
+ * b[l + (j + 3) * b_dim1];
+ f24 += t1[l - ll + 1 + ((i - ii + 2) << 8) - 257]
+ * b[l + (j + 3) * b_dim1];
+ f31 += t1[l - ll + 1 + ((i - ii + 3) << 8) - 257]
+ * b[l + j * b_dim1];
+ f41 += t1[l - ll + 1 + ((i - ii + 4) << 8) - 257]
+ * b[l + j * b_dim1];
+ f32 += t1[l - ll + 1 + ((i - ii + 3) << 8) - 257]
+ * b[l + (j + 1) * b_dim1];
+ f42 += t1[l - ll + 1 + ((i - ii + 4) << 8) - 257]
+ * b[l + (j + 1) * b_dim1];
+ f33 += t1[l - ll + 1 + ((i - ii + 3) << 8) - 257]
+ * b[l + (j + 2) * b_dim1];
+ f43 += t1[l - ll + 1 + ((i - ii + 4) << 8) - 257]
+ * b[l + (j + 2) * b_dim1];
+ f34 += t1[l - ll + 1 + ((i - ii + 3) << 8) - 257]
+ * b[l + (j + 3) * b_dim1];
+ f44 += t1[l - ll + 1 + ((i - ii + 4) << 8) - 257]
+ * b[l + (j + 3) * b_dim1];
+ }
+ c[i + j * c_dim1] = f11;
+ c[i + 1 + j * c_dim1] = f21;
+ c[i + (j + 1) * c_dim1] = f12;
+ c[i + 1 + (j + 1) * c_dim1] = f22;
+ c[i + (j + 2) * c_dim1] = f13;
+ c[i + 1 + (j + 2) * c_dim1] = f23;
+ c[i + (j + 3) * c_dim1] = f14;
+ c[i + 1 + (j + 3) * c_dim1] = f24;
+ c[i + 2 + j * c_dim1] = f31;
+ c[i + 3 + j * c_dim1] = f41;
+ c[i + 2 + (j + 1) * c_dim1] = f32;
+ c[i + 3 + (j + 1) * c_dim1] = f42;
+ c[i + 2 + (j + 2) * c_dim1] = f33;
+ c[i + 3 + (j + 2) * c_dim1] = f43;
+ c[i + 2 + (j + 3) * c_dim1] = f34;
+ c[i + 3 + (j + 3) * c_dim1] = f44;
+ }
+ if (uisec < isec)
+ {
+ i5 = ii + isec - 1;
+ for (i = ii + uisec; i <= i5; ++i)
+ {
+ f11 = c[i + j * c_dim1];
+ f12 = c[i + (j + 1) * c_dim1];
+ f13 = c[i + (j + 2) * c_dim1];
+ f14 = c[i + (j + 3) * c_dim1];
+ i6 = ll + lsec - 1;
+ for (l = ll; l <= i6; ++l)
+ {
+ f11 += t1[l - ll + 1 + ((i - ii + 1) << 8) -
+ 257] * b[l + j * b_dim1];
+ f12 += t1[l - ll + 1 + ((i - ii + 1) << 8) -
+ 257] * b[l + (j + 1) * b_dim1];
+ f13 += t1[l - ll + 1 + ((i - ii + 1) << 8) -
+ 257] * b[l + (j + 2) * b_dim1];
+ f14 += t1[l - ll + 1 + ((i - ii + 1) << 8) -
+ 257] * b[l + (j + 3) * b_dim1];
+ }
+ c[i + j * c_dim1] = f11;
+ c[i + (j + 1) * c_dim1] = f12;
+ c[i + (j + 2) * c_dim1] = f13;
+ c[i + (j + 3) * c_dim1] = f14;
+ }
+ }
+ }
+ if (ujsec < jsec)
+ {
+ i4 = jj + jsec - 1;
+ for (j = jj + ujsec; j <= i4; ++j)
+ {
+ i5 = ii + uisec - 1;
+ for (i = ii; i <= i5; i += 4)
+ {
+ f11 = c[i + j * c_dim1];
+ f21 = c[i + 1 + j * c_dim1];
+ f31 = c[i + 2 + j * c_dim1];
+ f41 = c[i + 3 + j * c_dim1];
+ i6 = ll + lsec - 1;
+ for (l = ll; l <= i6; ++l)
+ {
+ f11 += t1[l - ll + 1 + ((i - ii + 1) << 8) -
+ 257] * b[l + j * b_dim1];
+ f21 += t1[l - ll + 1 + ((i - ii + 2) << 8) -
+ 257] * b[l + j * b_dim1];
+ f31 += t1[l - ll + 1 + ((i - ii + 3) << 8) -
+ 257] * b[l + j * b_dim1];
+ f41 += t1[l - ll + 1 + ((i - ii + 4) << 8) -
+ 257] * b[l + j * b_dim1];
+ }
+ c[i + j * c_dim1] = f11;
+ c[i + 1 + j * c_dim1] = f21;
+ c[i + 2 + j * c_dim1] = f31;
+ c[i + 3 + j * c_dim1] = f41;
+ }
+ i5 = ii + isec - 1;
+ for (i = ii + uisec; i <= i5; ++i)
+ {
+ f11 = c[i + j * c_dim1];
+ i6 = ll + lsec - 1;
+ for (l = ll; l <= i6; ++l)
+ {
+ f11 += t1[l - ll + 1 + ((i - ii + 1) << 8) -
+ 257] * b[l + j * b_dim1];
+ }
+ c[i + j * c_dim1] = f11;
+ }
+ }
+ }
}
}
}
+ return;
}
else if (rxstride == 1 && aystride == 1 && bxstride == 1)
{
@@ -336,7 +572,9 @@ sinclude(`matmul_asm_'rtype_code`.m4')dnl
for (n = 0; n < count; n++)
for (x = 0; x < xcount; x++)
/* dest[x,y] += a[x,n] * b[n,y] */
- dest[x*rxstride + y*rystride] += abase[x*axstride + n*aystride] * bbase[n*bxstride + y*bystride];
+ dest[x*rxstride + y*rystride] +=
+ abase[x*axstride + n*aystride] *
+ bbase[n*bxstride + y*bystride];
}
else if (GFC_DESCRIPTOR_RANK (a) == 1)
{
@@ -373,6 +611,6 @@ sinclude(`matmul_asm_'rtype_code`.m4')dnl
}
}
}
-}
-
-#endif'
+}'
+#pragma GCC reset_options
+#endif
[-- Attachment #3: matmul_12.f90 --]
[-- Type: text/x-fortran, Size: 625 bytes --]
! { dg-do run }
program main
integer, parameter :: sz=5, su=3
integer, parameter :: l=2
integer, parameter :: u=l-1+su
integer(kind=4), dimension(sz,sz) :: r,a,b
integer :: i,j
do i=1,4
do j=1,4
a(i,j) = i*10+j
b(i,j) = 100+i*10+j
end do
end do
r = -1
b(l:u,l:u) = reshape([(i,i=1,su*su)],[su,su]);
a(l:u,l:u) = reshape([(i,i=1,su*su)],[su,su]);
r(1:su,1:su) = matmul(a(l:u,l:u),b(l:u,l:u))
if (any(reshape(r,[sz*sz]) /= [30, 36, 42, -1, -1, 66, 81, 96, -1, -1,&
& 102, 126, 150, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1])) &
call abort
end program main
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [patch,libgfortran] PR51119 - MATMUL slow for large matrices
2016-11-14 0:09 [patch,libgfortran] PR51119 - MATMUL slow for large matrices Jerry DeLisle
@ 2016-11-14 0:55 ` Steve Kargl
2016-11-14 1:01 ` Jerry DeLisle
2016-11-14 7:04 ` Thomas Koenig
1 sibling, 1 reply; 11+ messages in thread
From: Steve Kargl @ 2016-11-14 0:55 UTC (permalink / raw)
To: Jerry DeLisle; +Cc: fortran, GCC Patches
On Sun, Nov 13, 2016 at 04:08:50PM -0800, Jerry DeLisle wrote:
> Hi all,
>
> Attached patch implements a fast blocked matrix multiply. The basic algorithm is
> derived from netlib.org tuned blas dgemm. See matmul.m4 for reference.
>
> The matmul() function is compiled with -Ofast -funroll-loops. This can be
> customized further if there is an undesired optimization being used. This is
> accomplished using #pragma optimize ( string ).
>
Did you run any tests with '--param max-unroll-times=4' where
the 4 could be something other than 4. On troutmask, with my
code I've found that 4 seems to work the best with -funroll-loops.
--
Steve
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [patch,libgfortran] PR51119 - MATMUL slow for large matrices
2016-11-14 0:55 ` Steve Kargl
@ 2016-11-14 1:01 ` Jerry DeLisle
0 siblings, 0 replies; 11+ messages in thread
From: Jerry DeLisle @ 2016-11-14 1:01 UTC (permalink / raw)
To: Steve Kargl; +Cc: fortran, GCC Patches
On 11/13/2016 04:55 PM, Steve Kargl wrote:
> On Sun, Nov 13, 2016 at 04:08:50PM -0800, Jerry DeLisle wrote:
>> Hi all,
>>
>> Attached patch implements a fast blocked matrix multiply. The basic algorithm is
>> derived from netlib.org tuned blas dgemm. See matmul.m4 for reference.
>>
>> The matmul() function is compiled with -Ofast -funroll-loops. This can be
>> customized further if there is an undesired optimization being used. This is
>> accomplished using #pragma optimize ( string ).
>>
>
> Did you run any tests with '--param max-unroll-times=4' where
> the 4 could be something other than 4. On troutmask, with my
> code I've found that 4 seems to work the best with -funroll-loops.
>
Have not tried this, will give it a try. Also, I have not tested on your FreeBSD
machine yet.
Jerry
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [patch,libgfortran] PR51119 - MATMUL slow for large matrices
2016-11-14 0:09 [patch,libgfortran] PR51119 - MATMUL slow for large matrices Jerry DeLisle
2016-11-14 0:55 ` Steve Kargl
@ 2016-11-14 7:04 ` Thomas Koenig
2016-11-14 22:14 ` Jerry DeLisle
1 sibling, 1 reply; 11+ messages in thread
From: Thomas Koenig @ 2016-11-14 7:04 UTC (permalink / raw)
To: Jerry DeLisle, fortran; +Cc: GCC Patches
Hi Jerry,
I think this
+ /* Parameter adjustments */
+ c_dim1 = m;
+ c_offset = 1 + c_dim1;
should be
+ /* Parameter adjustments */
+ c_dim1 = rystride;
+ c_offset = 1 + c_dim1;
Regarding options for matmul: It is possible to add the
options to the lines in Makefile.in
# Turn on vectorization and loop unrolling for matmul.
$(patsubst %.c,%.lo,$(notdir $(i_matmul_c))): AM_CFLAGS +=
-ftree-vectorize -funroll-loops
This is a great step forward. I think we can close most matmul-related
PRs once this patch has been applied.
Regards
Thomas
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [patch,libgfortran] PR51119 - MATMUL slow for large matrices
2016-11-14 7:04 ` Thomas Koenig
@ 2016-11-14 22:14 ` Jerry DeLisle
2016-11-15 7:23 ` Thomas Koenig
2016-11-15 8:21 ` Richard Biener
0 siblings, 2 replies; 11+ messages in thread
From: Jerry DeLisle @ 2016-11-14 22:14 UTC (permalink / raw)
To: Thomas Koenig, fortran; +Cc: GCC Patches
On 11/13/2016 11:03 PM, Thomas Koenig wrote:
> Hi Jerry,
>
> I think this
>
> + /* Parameter adjustments */
> + c_dim1 = m;
> + c_offset = 1 + c_dim1;
>
> should be
>
> + /* Parameter adjustments */
> + c_dim1 = rystride;
> + c_offset = 1 + c_dim1;
>
> Regarding options for matmul: It is possible to add the
> options to the lines in Makefile.in
>
> # Turn on vectorization and loop unrolling for matmul.
> $(patsubst %.c,%.lo,$(notdir $(i_matmul_c))): AM_CFLAGS += -ftree-vectorize
> -funroll-loops
>
> This is a great step forward. I think we can close most matmul-related
> PRs once this patch has been applied.
>
> Regards
>
> Thomas
>
With Thomas suggestion, I can remove the #pragma optimize from the source code.
Doing this: (long lines wrapped as shown)
diff --git a/libgfortran/Makefile.am b/libgfortran/Makefile.am
index 39d3e11..9ee17f9 100644
--- a/libgfortran/Makefile.am
+++ b/libgfortran/Makefile.am
@@ -850,7 +850,7 @@ intrinsics/dprod_r8.f90 \
intrinsics/f2c_specifics.F90
# Turn on vectorization and loop unrolling for matmul.
-$(patsubst %.c,%.lo,$(notdir $(i_matmul_c))): AM_CFLAGS += -ftree-vectorize
-funroll-loops
+$(patsubst %.c,%.lo,$(notdir $(i_matmul_c))): AM_CFLAGS += -ffast-math
-fno-protect-parens -fstack-arrays -ftree-vectorize -funroll-loops --param
max-unroll-times=4 -ftree-loop-vectorize
# Logical matmul doesn't vectorize.
$(patsubst %.c,%.lo,$(notdir $(i_matmull_c))): AM_CFLAGS += -funroll-loops
Comparing gfortran 6 vs 7: (test program posted in PR51119)
$ gfc6 -static -Ofast -finline-matmul-limit=32 -funroll-loops --param
max-unroll-times=4 compare.f90
$ ./a.out
=========================================================
================ MEASURED GIGAFLOPS =
=========================================================
Matmul Matmul
fixed Matmul variable
Size Loops explicit refMatmul assumed explicit
=========================================================
2 2000 11.928 0.047 0.082 0.138
4 2000 1.455 0.220 0.371 0.316
8 2000 1.476 0.737 0.704 1.574
16 2000 4.536 3.755 2.825 3.820
32 2000 6.070 5.443 3.124 5.158
64 2000 5.423 5.355 5.405 5.413
128 2000 5.913 5.841 5.917 5.917
256 477 5.865 5.252 5.863 5.862
512 59 2.794 2.841 2.794 2.791
1024 7 1.662 1.356 1.662 1.661
2048 1 1.753 1.724 1.753 1.754
$ gfc -static -Ofast -finline-matmul-limit=32 -funroll-loops --param
max-unroll-times=4 compare.f90
$ ./a.out
=========================================================
================ MEASURED GIGAFLOPS =
=========================================================
Matmul Matmul
fixed Matmul variable
Size Loops explicit refMatmul assumed explicit
=========================================================
2 2000 12.146 0.042 0.090 0.146
4 2000 1.496 0.232 0.384 0.325
8 2000 2.330 0.765 0.763 0.965
16 2000 4.611 4.120 2.792 3.830
32 2000 6.068 5.265 3.102 4.859
64 2000 6.527 5.329 6.425 6.495
128 2000 8.207 5.643 8.336 8.441
256 477 9.210 4.967 9.367 9.299
512 59 8.330 2.772 8.422 8.342
1024 7 8.430 1.378 8.511 8.424
2048 1 8.339 1.718 8.425 8.322
I do think we need to adjust the default inline limit and should do this
separately from this patch.
With these changes, OK for trunk?
Regards,
Jerry
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [patch,libgfortran] PR51119 - MATMUL slow for large matrices
2016-11-14 22:14 ` Jerry DeLisle
@ 2016-11-15 7:23 ` Thomas Koenig
2016-11-15 15:59 ` Jerry DeLisle
2016-11-15 8:21 ` Richard Biener
1 sibling, 1 reply; 11+ messages in thread
From: Thomas Koenig @ 2016-11-15 7:23 UTC (permalink / raw)
To: Jerry DeLisle, fortran; +Cc: GCC Patches
Hi Jerry,
> With these changes, OK for trunk?
Just going over this with a fine comb...
One thing just struck me: The loop variables should be index_type, so
const index_type m = xcount, n = ycount, k = count;
[...]
index_type a_dim1, a_offset, b_dim1, b_offset, c_dim1, c_offset, i1, i2,
i3, i4, i5, i6;
/* Local variables */
GFC_REAL_4 t1[65536], /* was [256][256] */
f11, f12, f21, f22, f31, f32, f41, f42,
f13, f14, f23, f24, f33, f34, f43, f44;
index_type i, j, l, ii, jj, ll;
index_type isec, jsec, lsec, uisec, ujsec, ulsec;
I agree that we should do the tuning of the inline limit
separately.
When we do that, we should think about -Os. With the buffering, we have
much more memory usage in the library function. If -Os is in force,
we should also consider raising the limit for inlining.
Since I was involved in the development, I would like to give others a
few days to raise more comments. If there are none, OK to commit with
the above change within a few days. Of course, somebody else might also
OK this patch :-)
Regards
Thomas
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [patch,libgfortran] PR51119 - MATMUL slow for large matrices
2016-11-15 7:23 ` Thomas Koenig
@ 2016-11-15 15:59 ` Jerry DeLisle
2016-11-15 16:37 ` Jerry DeLisle
0 siblings, 1 reply; 11+ messages in thread
From: Jerry DeLisle @ 2016-11-15 15:59 UTC (permalink / raw)
To: Thomas Koenig, fortran; +Cc: GCC Patches
On 11/14/2016 11:22 PM, Thomas Koenig wrote:
> Hi Jerry,
>
>> With these changes, OK for trunk?
>
> Just going over this with a fine comb...
>
> One thing just struck me: The loop variables should be index_type, so
>
> const index_type m = xcount, n = ycount, k = count;
>
> [...]
>
> index_type a_dim1, a_offset, b_dim1, b_offset, c_dim1, c_offset, i1, i2,
> i3, i4, i5, i6;
>
> /* Local variables */
> GFC_REAL_4 t1[65536], /* was [256][256] */
> f11, f12, f21, f22, f31, f32, f41, f42,
> f13, f14, f23, f24, f33, f34, f43, f44;
> index_type i, j, l, ii, jj, ll;
> index_type isec, jsec, lsec, uisec, ujsec, ulsec;
>
> I agree that we should do the tuning of the inline limit
> separately.
>
Several of my iterations used index_type. I found using integer gives better
performance. The reason is that they are of type ptr_diff_t which is a 64 bit
integer. I suspect we eliminate one memory fetch for each of these and reduce
the register loading by reducing the number of registers needed, two for one
situation. I will change back and retest.
and Paul commeneted "-ftree-vectorize turns on -ftree-loop-vectorize and
-ftree-slp-vectorize already."
I will remove those to options and keep -ftree-vectorize
I will report back my findings.
Thanks, and a fine tooth comb is a very good thing.
Jerry
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [patch,libgfortran] PR51119 - MATMUL slow for large matrices
2016-11-15 15:59 ` Jerry DeLisle
@ 2016-11-15 16:37 ` Jerry DeLisle
2016-11-15 19:52 ` Janne Blomqvist
0 siblings, 1 reply; 11+ messages in thread
From: Jerry DeLisle @ 2016-11-15 16:37 UTC (permalink / raw)
To: Thomas Koenig, fortran; +Cc: GCC Patches
On 11/15/2016 07:59 AM, Jerry DeLisle wrote:
> On 11/14/2016 11:22 PM, Thomas Koenig wrote:
>> Hi Jerry,
>>
>>> With these changes, OK for trunk?
>>
>> Just going over this with a fine comb...
>>
>> One thing just struck me: The loop variables should be index_type, so
>>
>> const index_type m = xcount, n = ycount, k = count;
>>
>> [...]
>>
>> index_type a_dim1, a_offset, b_dim1, b_offset, c_dim1, c_offset, i1, i2,
>> i3, i4, i5, i6;
>>
>> /* Local variables */
>> GFC_REAL_4 t1[65536], /* was [256][256] */
>> f11, f12, f21, f22, f31, f32, f41, f42,
>> f13, f14, f23, f24, f33, f34, f43, f44;
>> index_type i, j, l, ii, jj, ll;
>> index_type isec, jsec, lsec, uisec, ujsec, ulsec;
>>
>> I agree that we should do the tuning of the inline limit
>> separately.
>>
>
> Several of my iterations used index_type. I found using integer gives better
> performance. The reason is that they are of type ptr_diff_t which is a 64 bit
> integer. I suspect we eliminate one memory fetch for each of these and reduce
> the register loading by reducing the number of registers needed, two for one
> situation. I will change back and retest.
>
> and Paul commeneted "-ftree-vectorize turns on -ftree-loop-vectorize and
> -ftree-slp-vectorize already."
>
> I will remove those to options and keep -ftree-vectorize
>
> I will report back my findings.
>
Changed back to index_type, all OK, must have been some OS stuff running in the
background.
All comments incorporated. Standing by for approval.
Jerry
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [patch,libgfortran] PR51119 - MATMUL slow for large matrices
2016-11-15 16:37 ` Jerry DeLisle
@ 2016-11-15 19:52 ` Janne Blomqvist
[not found] ` <3e423918-009b-c993-7cdb-b84aa23e9948@charter.net>
0 siblings, 1 reply; 11+ messages in thread
From: Janne Blomqvist @ 2016-11-15 19:52 UTC (permalink / raw)
To: Jerry DeLisle; +Cc: Thomas Koenig, fortran, GCC Patches
On Tue, Nov 15, 2016 at 6:37 PM, Jerry DeLisle <jvdelisle@charter.net> wrote:
> All comments incorporated. Standing by for approval.
Looks good, nice job! Ok for trunk.
I was thinking that for strided arrays, it probably is faster to copy
them to dense arrays before doing the matrix multiplication. That
would also enable using an optimized blas (-fexternal-blas) for
strided arrays. But this is of course nothing that blocks this patch,
just something that might be worth looking into in the future.
--
Janne Blomqvist
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [patch,libgfortran] PR51119 - MATMUL slow for large matrices
2016-11-14 22:14 ` Jerry DeLisle
2016-11-15 7:23 ` Thomas Koenig
@ 2016-11-15 8:21 ` Richard Biener
1 sibling, 0 replies; 11+ messages in thread
From: Richard Biener @ 2016-11-15 8:21 UTC (permalink / raw)
To: Jerry DeLisle; +Cc: Thomas Koenig, fortran, GCC Patches
On Mon, Nov 14, 2016 at 11:13 PM, Jerry DeLisle <jvdelisle@charter.net> wrote:
> On 11/13/2016 11:03 PM, Thomas Koenig wrote:
>>
>> Hi Jerry,
>>
>> I think this
>>
>> + /* Parameter adjustments */
>> + c_dim1 = m;
>> + c_offset = 1 + c_dim1;
>>
>> should be
>>
>> + /* Parameter adjustments */
>> + c_dim1 = rystride;
>> + c_offset = 1 + c_dim1;
>>
>> Regarding options for matmul: It is possible to add the
>> options to the lines in Makefile.in
>>
>> # Turn on vectorization and loop unrolling for matmul.
>> $(patsubst %.c,%.lo,$(notdir $(i_matmul_c))): AM_CFLAGS +=
>> -ftree-vectorize
>> -funroll-loops
>>
>> This is a great step forward. I think we can close most matmul-related
>> PRs once this patch has been applied.
>>
>> Regards
>>
>> Thomas
>>
>
> With Thomas suggestion, I can remove the #pragma optimize from the source
> code. Doing this: (long lines wrapped as shown)
>
> diff --git a/libgfortran/Makefile.am b/libgfortran/Makefile.am
> index 39d3e11..9ee17f9 100644
> --- a/libgfortran/Makefile.am
> +++ b/libgfortran/Makefile.am
> @@ -850,7 +850,7 @@ intrinsics/dprod_r8.f90 \
> intrinsics/f2c_specifics.F90
>
> # Turn on vectorization and loop unrolling for matmul.
> -$(patsubst %.c,%.lo,$(notdir $(i_matmul_c))): AM_CFLAGS += -ftree-vectorize
> -funroll-loops
> +$(patsubst %.c,%.lo,$(notdir $(i_matmul_c))): AM_CFLAGS += -ffast-math
> -fno-protect-parens -fstack-arrays -ftree-vectorize -funroll-loops --param
> max-unroll-times=4 -ftree-loop-vectorize
-ftree-vectorize turns on -ftree-loop-vectorize and
-ftree-slp-vectorize already.
> # Logical matmul doesn't vectorize.
> $(patsubst %.c,%.lo,$(notdir $(i_matmull_c))): AM_CFLAGS += -funroll-loops
>
>
> Comparing gfortran 6 vs 7: (test program posted in PR51119)
>
> $ gfc6 -static -Ofast -finline-matmul-limit=32 -funroll-loops --param
> max-unroll-times=4 compare.f90
> $ ./a.out
> =========================================================
> ================ MEASURED GIGAFLOPS =
> =========================================================
> Matmul Matmul
> fixed Matmul variable
> Size Loops explicit refMatmul assumed explicit
> =========================================================
> 2 2000 11.928 0.047 0.082 0.138
> 4 2000 1.455 0.220 0.371 0.316
> 8 2000 1.476 0.737 0.704 1.574
> 16 2000 4.536 3.755 2.825 3.820
> 32 2000 6.070 5.443 3.124 5.158
> 64 2000 5.423 5.355 5.405 5.413
> 128 2000 5.913 5.841 5.917 5.917
> 256 477 5.865 5.252 5.863 5.862
> 512 59 2.794 2.841 2.794 2.791
> 1024 7 1.662 1.356 1.662 1.661
> 2048 1 1.753 1.724 1.753 1.754
>
> $ gfc -static -Ofast -finline-matmul-limit=32 -funroll-loops --param
> max-unroll-times=4 compare.f90
> $ ./a.out
> =========================================================
> ================ MEASURED GIGAFLOPS =
> =========================================================
> Matmul Matmul
> fixed Matmul variable
> Size Loops explicit refMatmul assumed explicit
> =========================================================
> 2 2000 12.146 0.042 0.090 0.146
> 4 2000 1.496 0.232 0.384 0.325
> 8 2000 2.330 0.765 0.763 0.965
> 16 2000 4.611 4.120 2.792 3.830
> 32 2000 6.068 5.265 3.102 4.859
> 64 2000 6.527 5.329 6.425 6.495
> 128 2000 8.207 5.643 8.336 8.441
> 256 477 9.210 4.967 9.367 9.299
> 512 59 8.330 2.772 8.422 8.342
> 1024 7 8.430 1.378 8.511 8.424
> 2048 1 8.339 1.718 8.425 8.322
>
> I do think we need to adjust the default inline limit and should do this
> separately from this patch.
>
> With these changes, OK for trunk?
>
> Regards,
>
> Jerry
>
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2016-11-16 21:59 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-11-14 0:09 [patch,libgfortran] PR51119 - MATMUL slow for large matrices Jerry DeLisle
2016-11-14 0:55 ` Steve Kargl
2016-11-14 1:01 ` Jerry DeLisle
2016-11-14 7:04 ` Thomas Koenig
2016-11-14 22:14 ` Jerry DeLisle
2016-11-15 7:23 ` Thomas Koenig
2016-11-15 15:59 ` Jerry DeLisle
2016-11-15 16:37 ` Jerry DeLisle
2016-11-15 19:52 ` Janne Blomqvist
[not found] ` <3e423918-009b-c993-7cdb-b84aa23e9948@charter.net>
2016-11-16 21:59 ` Jerry DeLisle
2016-11-15 8:21 ` Richard Biener
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).