public inbox for gcc-help@gcc.gnu.org
 help / color / mirror / Atom feed
* Matrix multiplication: performance drop
@ 2009-03-03 16:05 Yury Serdyuk
  2009-03-03 16:35 ` John (Eljay) Love-Jensen
  2009-03-03 16:45 ` John Fine
  0 siblings, 2 replies; 3+ messages in thread
From: Yury Serdyuk @ 2009-03-03 16:05 UTC (permalink / raw)
  To: gcc-help

Hi !

I have a simple matrix multiplication code:

#include <stdio.h>
#include <time.h>
#include <malloc.h>

main ( int argc, char *argv[] )
{
int   i, j, k;
clock_t t1, t2;
double elapsed_time;

int    N = 10;

float *a, *b, *c;

if ( argc > 1 )
 N = atoi ( argv [ 1 ] );

a = (float *) malloc ( N * N * sizeof ( float ) );
b = (float *) malloc ( N * N * sizeof ( float ) );
c = (float *) malloc ( N * N * sizeof ( float ) );

for ( i = 0; i < N; i++ )
 for ( j = 0; j < N; j++ ) {
  a [ i * N + j ] = 0.99;
  b [ i * N + j ] = 0.33;
  c [ i * N + j ] = 0.0;
 }

t1 = clock();

for ( i = 0; i < N; i++ )
 for ( j = 0; j < N; j++ )
  for ( k = 0; k < N; k++ )
   c [ i * N + j ] += a [ i * N + k ] * b [ k * N + j ];

t2 = clock();

elapsed_time = ( (double) ( t2 - t1 ) ) / CLOCKS_PER_SEC;
printf ( "%f\n", c [ (N - 1) * N + ( N - 1 ) ] );

printf ( "N = %d Elapsed time = %lf\n", N, elapsed_time );

}

I compile it as
 > gcc -O3 ...
and then trying it for different N ( Intel Xeon, 3.00 GHz, 8 Gb memory)
Here are the results:
               N              Time (secs.)
   ----------------------------------
            500                0.25
            512                0.86
            520                0.29
           1000              4.46
           1024              12.5
           1100              6.48
           1500              20.42
           1536              30.43
           1600              21.04
           2000              46.75
           2048              446.61 ( !!! )
           2100               59.80

So for N multiple of 512 there is very strong drop of performance.
The question is - why and how to avoid it ?


In fact, there is a cache friendly solution, namely, to interchange the 
inner loops as:

for ( i = 0; i < N; i++ )
 for ( k = 0; k < N; k++ )
  for ( j = 0; j < N; j++ )
   c [ i * N + j ] += a [ i * N + k ] * b [ k * N + j ];

but a question remains -
why an unoptimized code works fine, say, for N = 2100 ,
but doesn't work for N = 2048, or, in general, for N multiply of 512?
What is a magic number ?

Thanks,
                 Yury.

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Matrix multiplication: performance drop
  2009-03-03 16:05 Matrix multiplication: performance drop Yury Serdyuk
@ 2009-03-03 16:35 ` John (Eljay) Love-Jensen
  2009-03-03 16:45 ` John Fine
  1 sibling, 0 replies; 3+ messages in thread
From: John (Eljay) Love-Jensen @ 2009-03-03 16:35 UTC (permalink / raw)
  To: Yury Serdyuk, GCC-help

Hi Yury,

> So for N multiple of 512 there is very strong drop of performance.
> The question is - why and how to avoid it ?

I presume the 'why' is because the cache is trashing.

You can avoid it by using a cache friendly calculation.

> In fact, there is a cache friendly solution, namely, to interchange the
> inner loops as:
> 
> for ( i = 0; i < N; i++ )
>  for ( k = 0; k < N; k++ )
>   for ( j = 0; j < N; j++ )
>    c [ i * N + j ] += a [ i * N + k ] * b [ k * N + j ];

That solution worked for me.  A x5 improvement in general, and a x40
improvement in the cache thrash multiple of 512 case.

Even more improvement is possible by walking memory in such a way as to keep
the cache as fresh as possible as long as possible.  (I think ACM has many
articles on how to walk memory for optimal cache use.)

> but a question remains -
> why an unoptimized code works fine, say, for N = 2100 ,
> but doesn't work for N = 2048, or, in general, for N multiply of 512?
> What is a magic number ?

The unoptimized code did not work fine for me.  It was 1/5th the speed of
the improved cache friendly solution, for the general case.  (Dual 2.7 GHz
PowerPC G5.)

The take-away item is that compiler optimization -O3 is no substitute for
optimal algorithms.  (And optimal algorithms cannot compensate for a bad
architecture... but that's a story for a different day.)  And that profiling
is important for identifying the routines that can benefit from being either
hand-optimized, or being massaged to help the optimizer better do its job
(even it it results in somewhat ugly code... please comment heavily for the
poor maintenance programmer).

Sincerely,
--Eljay

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Matrix multiplication: performance drop
  2009-03-03 16:05 Matrix multiplication: performance drop Yury Serdyuk
  2009-03-03 16:35 ` John (Eljay) Love-Jensen
@ 2009-03-03 16:45 ` John Fine
  1 sibling, 0 replies; 3+ messages in thread
From: John Fine @ 2009-03-03 16:45 UTC (permalink / raw)
  To: Yury Serdyuk; +Cc: gcc-help

L2 cache is associative.  The exact design of that associativity creates 
a power of two, such that stepping through memory with a step size of 
that power of two (or a nearby power of two) will cause far more cache 
misses than a non power of two step size.

Yury Serdyuk wrote:
>
> why an unoptimized code works fine, say, for N = 2100 ,
> but doesn't work for N = 2048, or, in general, for N multiply of 512?
> What is a magic number ?

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2009-03-03 16:45 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-03-03 16:05 Matrix multiplication: performance drop Yury Serdyuk
2009-03-03 16:35 ` John (Eljay) Love-Jensen
2009-03-03 16:45 ` John Fine

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).