cache optimization

public inbox for gcc-help@gcc.gnu.org
 help / color / mirror / Atom feed

* cache optimization
@ 2009-11-26 13:14 £ukasz
  2009-11-26 15:38 ` Tim Prince
  2009-11-26 15:38 ` Harald Servat
  0 siblings, 2 replies; 6+ messages in thread
From: £ukasz @ 2009-11-26 13:14 UTC (permalink / raw)
  To: gcc-help

Hi I want to learn how to optimaze cache usage in gcc. I find builtin function __builtin_prefetch which should prefetch datas to cache .. so i use cannonical :) example of vector addition.

for (i = 0; i < n; i++)
  {
    a[i] = a[i] + b[i];
    __builtin_prefetch (&a[i+1], 1, 1);
    __builtin_prefetch (&b[i+1], 0, 1);
    /* ... */
  }

and compile it with gcc without special options .... but its slower than

for (i = 0; i < n; i++)
  {
    a[i] = a[i] + b[i];
    /* ... */
  }

so maybe I should compile it with soem extra options to have advantage of cache prefatching ?(-fprefetch-loop-array doenst works )




^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: cache optimization
  2009-11-26 13:14 cache optimization £ukasz
@ 2009-11-26 15:38 ` Tim Prince
  2009-11-28 18:19   ` £ukasz
  2009-11-26 15:38 ` Harald Servat
  1 sibling, 1 reply; 6+ messages in thread
From: Tim Prince @ 2009-11-26 15:38 UTC (permalink / raw)
  To: £ukasz; +Cc: gcc-help

Â£ukasz wrote:
> Hi I want to learn how to optimaze cache usage in gcc. I find builtin function __builtin_prefetch which should prefetch datas to cache .. so i use cannonical :) example of vector addition.
> 
> for (i = 0; i < n; i++)
>   {
>     a[i] = a[i] + b[i];
>     __builtin_prefetch (&a[i+1], 1, 1);
>     __builtin_prefetch (&b[i+1], 0, 1);
>     /* ... */
>   }
> 
> and compile it with gcc without special options .... but its slower than
> 
> for (i = 0; i < n; i++)
>   {
>     a[i] = a[i] + b[i];
>     /* ... */
>   }
> 
> so maybe I should compile it with soem extra options to have advantage of cache prefatching ?(-fprefetch-loop-array doenst works )
> 
> 
> 
Under normal settings, on CPUs of the last 6 years or so, you are 
prefetching what has already been prefetched by hardware prefetcher.  If 
your search engine doesn't find you many success stories about the use 
of this feature, that might be a clue that it involves some serious 
investigation. You would look for slow spots in your code which don't 
fall in the usual hardware supported prefetch patterns (linear access 
with not too large a stride, or pairs of cache lines), and experiment 
with fetching the data sufficiently far in advance for it to do some 
good, without exceeding your cache capacity.
I do see a "success story" about prefetching for a reversed loop. As the 
author doesn't divulge the CPU in use, one suspects it might be 
something like the old Athlon32 which supported hardware prefetch only 
in the forward direction. Don't you like advice which assumes no one 
will ever use a CPU different (e.g. more up to date) than the author's 
favorite?

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: cache optimization
  2009-11-26 13:14 cache optimization £ukasz
  2009-11-26 15:38 ` Tim Prince
@ 2009-11-26 15:38 ` Harald Servat
  2009-11-28 18:33   ` £ukasz
  1 sibling, 1 reply; 6+ messages in thread
From: Harald Servat @ 2009-11-26 15:38 UTC (permalink / raw)
  To: £ukasz; +Cc: gcc-help

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Although there are other factors that may apply, you must consider that
prefeteching the next element may be not enough for current
microprocessors. First, I'm pretty sure that the processor would have
the entire cache line once you bring some data. And second, nowadays,
micoprocessors have hardware prefetching that do the same work for you
without the need of adding extra instructions (remember, adding
instructions make your code slower). Also consider that the
microprocessor has other out-of-order special circuitry that may allow
it to be working on "future" iterations while waiting on a cache miss.

So I think that you may prefetch data which is farther. With current gap
between micro and memory speeds, I wouldn't be too impressed if you had
to go at 7 or 8 cache lines away.

Just my .02 euro cents :)

En/na Â£ukasz ha escrit:
> Hi I want to learn how to optimaze cache usage in gcc. I find builtin function __builtin_prefetch which should prefetch datas to cache .. so i use cannonical :) example of vector addition.
> 
> for (i = 0; i < n; i++)
>   {
>     a[i] = a[i] + b[i];
>     __builtin_prefetch (&a[i+1], 1, 1);
>     __builtin_prefetch (&b[i+1], 0, 1);
>     /* ... */
>   }
> 
> and compile it with gcc without special options .... but its slower than
> 
> for (i = 0; i < n; i++)
>   {
>     a[i] = a[i] + b[i];
>     /* ... */
>   }
> 
> so maybe I should compile it with soem extra options to have advantage of cache prefatching ?(-fprefetch-loop-array doenst works )
> 
> 
> 
> 

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.9 (GNU/Linux)

iEYEARECAAYFAksOoOsACgkQwMPeuqUCg9yj9wCbBd7DxNBKk9uNzV5xz4r66He4
r9gAnRncLhV0SYr6MgoUz7qG+hSL8S9b
=t8mz
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: cache optimization
  2009-11-26 15:38 ` Tim Prince
@ 2009-11-28 18:19   ` £ukasz
  2009-11-28 20:52     ` Tim Prince
  0 siblings, 1 reply; 6+ messages in thread
From: £ukasz @ 2009-11-28 18:19 UTC (permalink / raw)
  To: gcc-help



--- On Thu, 11/26/09, Tim Prince <n8tm@aol.com> wrote:

> From: Tim Prince <n8tm@aol.com>
> Subject: Re: cache optimization
> To: "£ukasz" <blurrpp@yahoo.com>
> Cc: gcc-help@gcc.gnu.org
> Date: Thursday, November 26, 2009, 4:38 PM
> £ukasz wrote:
> > Hi I want to learn how to optimaze cache usage in gcc.
> I find builtin function __builtin_prefetch which should
> prefetch datas to cache .. so i use cannonical :) example of
> vector addition.
> > 
> > for (i = 0; i < n; i++)
> >   {
> >     a[i] = a[i] + b[i];
> >     __builtin_prefetch
> (&a[i+1], 1, 1);
> >     __builtin_prefetch
> (&b[i+1], 0, 1);
> >     /* ... */
> >   }
> > 
> > and compile it with gcc without special options ....
> but its slower than
> > 
> > for (i = 0; i < n; i++)
> >   {
> >     a[i] = a[i] + b[i];
> >     /* ... */
> >   }
> > 
> > so maybe I should compile it with soem extra options
> to have advantage of cache prefatching
> ?(-fprefetch-loop-array doenst works )
> > 
> > 
> > 
> Under normal settings, on CPUs of the last 6 years or so,
> you are prefetching what has already been prefetched by
> hardware prefetcher.  If your search engine doesn't
> find you many success stories about the use of this feature,
> that might be a clue that it involves some serious
> investigation. You would look for slow spots in your code
> which don't fall in the usual hardware supported prefetch
> patterns (linear access with not too large a stride, or
> pairs of cache lines), and experiment with fetching the data
> sufficiently far in advance for it to do some good, without
> exceeding your cache capacity.
> I do see a "success story" about prefetching for a reversed
> loop. As the author doesn't divulge the CPU in use, one
> suspects it might be something like the old Athlon32 which
> supported hardware prefetch only in the forward direction.
> Don't you like advice which assumes no one will ever use a
> CPU different (e.g. more up to date) than the author's
> favorite?
> 

You are completely right, in this example gcc compiler change code to branch assembly, which ofcourse is already "predicted" ( means forward NOT TAKEN, backward TAKEN), but im looking for some nice example for modern procesors which would realy works I mean speed program(im searching net actualy ). In Intel Optimization Reference Manual they advice to use PREFETCH to any predicable memory access patern, but as you already mensioned some patern processor can predict by him self.



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: cache optimization
  2009-11-26 15:38 ` Harald Servat
@ 2009-11-28 18:33   ` £ukasz
  0 siblings, 0 replies; 6+ messages in thread
From: £ukasz @ 2009-11-28 18:33 UTC (permalink / raw)
  To: gcc-help

"(remember, adding instructions make your code slower)"

I would agree with you totaly two years ago, before i started writing GPU programs. There, programs which copy special part of data to SHARE memory, and has about 20% more instructions because of it, works 10 times faster than those without. Thanks for answer and if you have some nice example of speeding up CPU programs with using prefatching, or some good literature please give link :).

CentCollector :)

--- On Thu, 11/26/09, Harald Servat <harald.servat@bsc.es> wrote:

> From: Harald Servat <harald.servat@bsc.es>
> Subject: Re: cache optimization
> To: "£ukasz" <blurrpp@yahoo.com>
> Cc: gcc-help@gcc.gnu.org
> Date: Thursday, November 26, 2009, 4:38 PM
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> 
> Although there are other factors that may apply, you must
> consider that
> prefeteching the next element may be not enough for
> current
> microprocessors. First, I'm pretty sure that the processor
> would have
> the entire cache line once you bring some data. And second,
> nowadays,
> micoprocessors have hardware prefetching that do the same
> work for you
> without the need of adding extra instructions (remember,
> adding
> instructions make your code slower). Also consider that
> the
> microprocessor has other out-of-order special circuitry
> that may allow
> it to be working on "future" iterations while waiting on a
> cache miss.
> 
> So I think that you may prefetch data which is farther.
> With current gap
> between micro and memory speeds, I wouldn't be too
> impressed if you had
> to go at 7 or 8 cache lines away.
> 
> Just my .02 euro cents :)
> 
> En/na £ukasz ha escrit:
> > Hi I want to learn how to optimaze cache usage in gcc.
> I find builtin function __builtin_prefetch which should
> prefetch datas to cache .. so i use cannonical :) example of
> vector addition.
> > 
> > for (i = 0; i < n; i++)
> >   {
> >     a[i] = a[i] + b[i];
> >     __builtin_prefetch
> (&a[i+1], 1, 1);
> >     __builtin_prefetch
> (&b[i+1], 0, 1);
> >     /* ... */
> >   }
> > 
> > and compile it with gcc without special options ....
> but its slower than
> > 
> > for (i = 0; i < n; i++)
> >   {
> >     a[i] = a[i] + b[i];
> >     /* ... */
> >   }
> > 
> > so maybe I should compile it with soem extra options
> to have advantage of cache prefatching
> ?(-fprefetch-loop-array doenst works )
> > 
> > 
> > 
> > 
> 
> 
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v2.0.9 (GNU/Linux)
> 
> iEYEARECAAYFAksOoOsACgkQwMPeuqUCg9yj9wCbBd7DxNBKk9uNzV5xz4r66He4
> r9gAnRncLhV0SYr6MgoUz7qG+hSL8S9b
> =t8mz
> -----END PGP SIGNATURE-----
> 



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: cache optimization
  2009-11-28 18:19   ` £ukasz
@ 2009-11-28 20:52     ` Tim Prince
  0 siblings, 0 replies; 6+ messages in thread
From: Tim Prince @ 2009-11-28 20:52 UTC (permalink / raw)
  To: £ukasz; +Cc: gcc-help

Â£ukasz wrote:
> 
> --- On Thu, 11/26/09, Tim Prince <n8tm@aol.com> wrote:
> 
>> From: Tim Prince <n8tm@aol.com>
>> Subject: Re: cache optimization
>> To: "Â£ukasz" <blurrpp@yahoo.com>
>> Cc: gcc-help@gcc.gnu.org
>> Date: Thursday, November 26, 2009, 4:38 PM
>> Â£ukasz wrote:
>>> Hi I want to learn how to optimaze cache usage in gcc.
>> I find builtin function __builtin_prefetch which should
>> prefetch datas to cache .. so i use cannonical :) example of
>> vector addition.
>>> for (i = 0; i < n; i++)
>>>    {
>>>      a[i] = a[i] + b[i];
>>>      __builtin_prefetch
>> (&a[i+1], 1, 1);
>>>      __builtin_prefetch
>> (&b[i+1], 0, 1);
>>>      /* ... */
>>>    }
>>>
>>> and compile it with gcc without special options ....
>> but its slower than
>>> for (i = 0; i < n; i++)
>>>    {
>>>      a[i] = a[i] + b[i];
>>>      /* ... */
>>>    }
>>>
>>> so maybe I should compile it with soem extra options
>> to have advantage of cache prefatching
>> ?(-fprefetch-loop-array doenst works )
>>>
>>>
>> Under normal settings, on CPUs of the last 6 years or so,
>> you are prefetching what has already been prefetched by
>> hardware prefetcher.  If your search engine doesn't
>> find you many success stories about the use of this feature,
>> that might be a clue that it involves some serious
>> investigation. You would look for slow spots in your code
>> which don't fall in the usual hardware supported prefetch
>> patterns (linear access with not too large a stride, or
>> pairs of cache lines), and experiment with fetching the data
>> sufficiently far in advance for it to do some good, without
>> exceeding your cache capacity.
>> I do see a "success story" about prefetching for a reversed
>> loop. As the author doesn't divulge the CPU in use, one
>> suspects it might be something like the old Athlon32 which
>> supported hardware prefetch only in the forward direction.
>> Don't you like advice which assumes no one will ever use a
>> CPU different (e.g. more up to date) than the author's
>> favorite?
>>
> 
> You are completely right, in this example gcc compiler change code to branch assembly, which ofcourse is already "predicted" ( means forward NOT TAKEN, backward TAKEN), but im looking for some nice example for modern procesors which would realy works I mean speed program(im searching net actualy ). In Intel Optimization Reference Manual they advice to use PREFETCH to any predicable memory access patern, but as you already mensioned some patern processor can predict by him self.
> 
> 
Prefetch intrinsics might be effective in a case such as

for(i=0; i<n; ++i)
    a[indx[i]] += b[indx[i]];

but, as another responder said, you must recognize that you will be 
fetching whole cache lines, or pairs of cache lines, and may need half a 
dozen in advance.  If you are running on an Atom (non-out-of-order), you 
might expect entirely different results, and different optimum strategy, 
from a CPU with a large out-of-order queue.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2009-11-28 20:52 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-11-26 13:14 cache optimization £ukasz
2009-11-26 15:38 ` Tim Prince
2009-11-28 18:19   ` £ukasz
2009-11-28 20:52     ` Tim Prince
2009-11-26 15:38 ` Harald Servat
2009-11-28 18:33   ` £ukasz

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).