* Matrix multiplication: performance drop
@ 2009-03-03 16:05 Yury Serdyuk
2009-03-03 16:35 ` John (Eljay) Love-Jensen
2009-03-03 16:45 ` John Fine
0 siblings, 2 replies; 3+ messages in thread
From: Yury Serdyuk @ 2009-03-03 16:05 UTC (permalink / raw)
To: gcc-help
Hi !
I have a simple matrix multiplication code:
#include <stdio.h>
#include <time.h>
#include <malloc.h>
main ( int argc, char *argv[] )
{
int i, j, k;
clock_t t1, t2;
double elapsed_time;
int N = 10;
float *a, *b, *c;
if ( argc > 1 )
N = atoi ( argv [ 1 ] );
a = (float *) malloc ( N * N * sizeof ( float ) );
b = (float *) malloc ( N * N * sizeof ( float ) );
c = (float *) malloc ( N * N * sizeof ( float ) );
for ( i = 0; i < N; i++ )
for ( j = 0; j < N; j++ ) {
a [ i * N + j ] = 0.99;
b [ i * N + j ] = 0.33;
c [ i * N + j ] = 0.0;
}
t1 = clock();
for ( i = 0; i < N; i++ )
for ( j = 0; j < N; j++ )
for ( k = 0; k < N; k++ )
c [ i * N + j ] += a [ i * N + k ] * b [ k * N + j ];
t2 = clock();
elapsed_time = ( (double) ( t2 - t1 ) ) / CLOCKS_PER_SEC;
printf ( "%f\n", c [ (N - 1) * N + ( N - 1 ) ] );
printf ( "N = %d Elapsed time = %lf\n", N, elapsed_time );
}
I compile it as
> gcc -O3 ...
and then trying it for different N ( Intel Xeon, 3.00 GHz, 8 Gb memory)
Here are the results:
N Time (secs.)
----------------------------------
500 0.25
512 0.86
520 0.29
1000 4.46
1024 12.5
1100 6.48
1500 20.42
1536 30.43
1600 21.04
2000 46.75
2048 446.61 ( !!! )
2100 59.80
So for N multiple of 512 there is very strong drop of performance.
The question is - why and how to avoid it ?
In fact, there is a cache friendly solution, namely, to interchange the
inner loops as:
for ( i = 0; i < N; i++ )
for ( k = 0; k < N; k++ )
for ( j = 0; j < N; j++ )
c [ i * N + j ] += a [ i * N + k ] * b [ k * N + j ];
but a question remains -
why an unoptimized code works fine, say, for N = 2100 ,
but doesn't work for N = 2048, or, in general, for N multiply of 512?
What is a magic number ?
Thanks,
Yury.
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: Matrix multiplication: performance drop
2009-03-03 16:05 Matrix multiplication: performance drop Yury Serdyuk
@ 2009-03-03 16:35 ` John (Eljay) Love-Jensen
2009-03-03 16:45 ` John Fine
1 sibling, 0 replies; 3+ messages in thread
From: John (Eljay) Love-Jensen @ 2009-03-03 16:35 UTC (permalink / raw)
To: Yury Serdyuk, GCC-help
Hi Yury,
> So for N multiple of 512 there is very strong drop of performance.
> The question is - why and how to avoid it ?
I presume the 'why' is because the cache is trashing.
You can avoid it by using a cache friendly calculation.
> In fact, there is a cache friendly solution, namely, to interchange the
> inner loops as:
>
> for ( i = 0; i < N; i++ )
> for ( k = 0; k < N; k++ )
> for ( j = 0; j < N; j++ )
> c [ i * N + j ] += a [ i * N + k ] * b [ k * N + j ];
That solution worked for me. A x5 improvement in general, and a x40
improvement in the cache thrash multiple of 512 case.
Even more improvement is possible by walking memory in such a way as to keep
the cache as fresh as possible as long as possible. (I think ACM has many
articles on how to walk memory for optimal cache use.)
> but a question remains -
> why an unoptimized code works fine, say, for N = 2100 ,
> but doesn't work for N = 2048, or, in general, for N multiply of 512?
> What is a magic number ?
The unoptimized code did not work fine for me. It was 1/5th the speed of
the improved cache friendly solution, for the general case. (Dual 2.7 GHz
PowerPC G5.)
The take-away item is that compiler optimization -O3 is no substitute for
optimal algorithms. (And optimal algorithms cannot compensate for a bad
architecture... but that's a story for a different day.) And that profiling
is important for identifying the routines that can benefit from being either
hand-optimized, or being massaged to help the optimizer better do its job
(even it it results in somewhat ugly code... please comment heavily for the
poor maintenance programmer).
Sincerely,
--Eljay
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: Matrix multiplication: performance drop
2009-03-03 16:05 Matrix multiplication: performance drop Yury Serdyuk
2009-03-03 16:35 ` John (Eljay) Love-Jensen
@ 2009-03-03 16:45 ` John Fine
1 sibling, 0 replies; 3+ messages in thread
From: John Fine @ 2009-03-03 16:45 UTC (permalink / raw)
To: Yury Serdyuk; +Cc: gcc-help
L2 cache is associative. The exact design of that associativity creates
a power of two, such that stepping through memory with a step size of
that power of two (or a nearby power of two) will cause far more cache
misses than a non power of two step size.
Yury Serdyuk wrote:
>
> why an unoptimized code works fine, say, for N = 2100 ,
> but doesn't work for N = 2048, or, in general, for N multiply of 512?
> What is a magic number ?
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2009-03-03 16:45 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-03-03 16:05 Matrix multiplication: performance drop Yury Serdyuk
2009-03-03 16:35 ` John (Eljay) Love-Jensen
2009-03-03 16:45 ` John Fine
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).