* prefetching on pentium 4 @ 2006-11-28 10:17 ranjith kumar 2006-11-28 13:45 ` Tim Prince 2006-11-28 15:25 ` Ian Lance Taylor 0 siblings, 2 replies; 5+ messages in thread From: ranjith kumar @ 2006-11-28 10:17 UTC (permalink / raw) To: gcc-help Hi, 1) Will "gcc" insert prefetch instructions automatically on "pentium 4" processor? Which flags should be enabled while compiling sothat gcc automatically insert prefetch instructions? 2) Or programmer has to include some functions? If so, what is the syntax of that function? Thanks in advance. Send instant messages to your online friends http://uk.messenger.yahoo.com ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: prefetching on pentium 4 2006-11-28 10:17 prefetching on pentium 4 ranjith kumar @ 2006-11-28 13:45 ` Tim Prince 2006-12-05 18:54 ` ranjith kumar 2006-11-28 15:25 ` Ian Lance Taylor 1 sibling, 1 reply; 5+ messages in thread From: Tim Prince @ 2006-11-28 13:45 UTC (permalink / raw) To: ranjith kumar; +Cc: gcc-help ranjith kumar wrote: > Hi, > > 1) Will "gcc" insert prefetch instructions > automatically on "pentium 4" processor? > Which flags should be enabled while compiling sothat > gcc automatically insert prefetch instructions? > > 2) Or programmer has to include some functions? > If so, what is the syntax of that function? > P4 isn't suitable for automatic compiler-generated prefetch. Default hardware prefetch (stride-based and cache line pairs) is quite effective. Prefetch intrinsics are available with #include <xmmintrin.h>. Details on what works vary with steppings. The earliest P4 models could accelerate hardware prefetch by the program issuing 3 cache lines of prefetch prior to entering a loop. Since Northwood, that doesn't work. Since Prescott, prefetch hints are ignored on P4, with prefetch going to L2 regardless of hints. Effect of prefetch on DTLB misses also is model dependent. Contrary to what certain Windows related docs say, _mm_prefetch() works the same on all compilers which implement it. ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: prefetching on pentium 4 2006-11-28 13:45 ` Tim Prince @ 2006-12-05 18:54 ` ranjith kumar 2006-12-05 20:34 ` Tim Prince 0 siblings, 1 reply; 5+ messages in thread From: ranjith kumar @ 2006-12-05 18:54 UTC (permalink / raw) To: tprince, gcc-help --- Tim Prince <timothyprince@sbcglobal.net> wrote: > ranjith kumar wrote: > > Hi, > > > > 1) Will "gcc" insert prefetch instructions > > automatically on "pentium 4" processor? > > Which flags should be enabled while compiling > sothat > > gcc automatically insert prefetch instructions? > > > > 2) Or programmer has to include some functions? > > If so, what is the syntax of that function? > > > P4 isn't suitable for automatic compiler-generated > prefetch. Default > hardware prefetch (stride-based and cache line > pairs) is quite > effective. Prefetch intrinsics are available with > #include > <xmmintrin.h>. Details on what works vary with > steppings. The earliest > P4 models could accelerate hardware prefetch by the > program issuing 3 > cache lines of prefetch prior to entering a loop. > Since Northwood, that > doesn't work. Since Prescott, prefetch hints are > ignored on P4, with > prefetch going to L2 regardless of hints. Effect of > prefetch on DTLB > misses also is model dependent. > Contrary to what certain Windows related docs say, > _mm_prefetch() works > the same on all compilers which implement it. > Hi, 1) What is the difference between "prefetchnta" and "prefetchT1" instructions in case of pentium 4 processor. In IA-32 optimization manual it was given that prefetchT1,prefetchT2 and prefetchT3 are identical in case of pentium 4 processor. Also prefetchnta fetches the data into second level cache "minimizing cache pollution". What does "minimizing cache pollution" mean? When I compared two programs, first one prefetching data using "prefetchnta" and the second one using "prefetchT0", I observed that second program was executed faster. What could be the reason? p.s: Below is the program which uses "prefetchT0". To get program which uses "prefetchnta" send 0 as second argument to fucntion in 22 line. I run then on pentium4 processor with fedora core operating system. 1 #include<stdio.h> 2 #include<xmmintrin.h> 3 int main() 4 { 5 6 int i,j,k,h; 7 struct list 8 { 9 long double w,w1,u,u1,x,x1,y,y1,z,z1; 10 long double e1,e2,e3,e4; 11 long double b1,b2,b3,b4,b5,b6; 12 long int a,b,c,d,e; 13 }l[5000]; 14 15 16 int total; 17 for(h=0;h<9;h++) 18 for(j=0;j<99999;j++) 19 for(i=0;i<1000;i++) 20 { 21 //k=rand()%500; 22 _mm_prefetch((&l[(i+2)].a),3); 23 24 total+=(l[i*5].a)*(l[i*5].b)*(l[i*5].c)*(l[i*5].d)*(l[i*5].e); 25 26 //printf("\n %d ",total); 27 } 28 29 printf("\n %d ",total); 30 } Send instant messages to your online friends http://uk.messenger.yahoo.com ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: prefetching on pentium 4 2006-12-05 18:54 ` ranjith kumar @ 2006-12-05 20:34 ` Tim Prince 0 siblings, 0 replies; 5+ messages in thread From: Tim Prince @ 2006-12-05 20:34 UTC (permalink / raw) To: ranjith kumar; +Cc: tprince, gcc-help ranjith kumar wrote: > --- Tim Prince <timothyprince@sbcglobal.net> wrote: > >> ranjith kumar wrote: >>> Hi, >>> >>> 1) Will "gcc" insert prefetch instructions >>> automatically on "pentium 4" processor? >>> Which flags should be enabled while compiling >> sothat >>> gcc automatically insert prefetch instructions? >>> >>> 2) Or programmer has to include some functions? >>> If so, what is the syntax of that function? >>> >> P4 isn't suitable for automatic compiler-generated >> prefetch. Default >> hardware prefetch (stride-based and cache line >> pairs) is quite >> effective. > > Prefetch intrinsics are available with >> #include >> <xmmintrin.h>. Details on what works vary with >> steppings. The earliest >> P4 models could accelerate hardware prefetch by the >> program issuing 3 >> cache lines of prefetch prior to entering a loop. >> Since Northwood, that >> doesn't work. Since Prescott, prefetch hints are >> ignored on P4, with >> prefetch going to L2 regardless of hints. Effect of >> prefetch on DTLB >> misses also is model dependent. >> Contrary to what certain Windows related docs say, >> _mm_prefetch() works >> the same on all compilers which implement it. >> > > Hi, > 1) What is the difference between "prefetchnta" and > > "prefetchT1" instructions in case of pentium 4 > processor. > In IA-32 optimization manual it was given that > prefetchT1,prefetchT2 and prefetchT3 are identical in > case of pentium 4 processor. Also prefetchnta fetches > the data into second level cache "minimizing cache > pollution". What does "minimizing cache pollution" > mean? Schemes for fetching directly to L1 have generally been abandoned in favor of waiting until the hardware requests data. L1 isn't big enough to handle extra data brought in "speculatively" with enough advance notice to handle L2 misses. > > When I compared two programs, first one prefetching > data using "prefetchnta" and the second one using > "prefetchT0", I observed that second program was > executed faster. What could be the reason? I don't have the experience to comment on that, and it may well depend on which type of P4 you have. Maybe your data are resident in L2 and you have a P4 model which benefits from prefetching them into L1. You don't even say which compiler you are using or why you wouldn't try vectorization if you were serious about "real-world" performance. Integer multiply on P4 (at least the early ones) is so slow it's hard to imagine much value for other optimizations. You might do better on the cacheing side by reorganizing your data than by using prefetch. > > p.s: Below is the program which uses "prefetchT0". To > get program which uses "prefetchnta" send 0 as second > argument to fucntion in 22 line. I run then on > pentium4 processor with fedora core operating system. > > 1 #include<stdio.h> > 2 #include<xmmintrin.h> > 3 int main() > 4 { > 5 > 6 int i,j,k,h; > 7 struct list > 8 { > 9 long double > w,w1,u,u1,x,x1,y,y1,z,z1; > 10 long double e1,e2,e3,e4; > 11 long double b1,b2,b3,b4,b5,b6; > 12 long int a,b,c,d,e; > 13 }l[5000]; > 14 > 15 > 16 int total; > 17 for(h=0;h<9;h++) > 18 for(j=0;j<99999;j++) > 19 for(i=0;i<1000;i++) > 20 { > 21 //k=rand()%500; > 22 _mm_prefetch((&l[(i+2)].a),3); > 23 > 24 > total+=(l[i*5].a)*(l[i*5].b)*(l[i*5].c)*(l[i*5].d)*(l[i*5].e); > 25 > 26 //printf("\n %d ",total); > 27 } > 28 > 29 printf("\n %d ",total); > 30 } > > > > > > > Send instant messages to your online friends http://uk.messenger.yahoo.com > > ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: prefetching on pentium 4 2006-11-28 10:17 prefetching on pentium 4 ranjith kumar 2006-11-28 13:45 ` Tim Prince @ 2006-11-28 15:25 ` Ian Lance Taylor 1 sibling, 0 replies; 5+ messages in thread From: Ian Lance Taylor @ 2006-11-28 15:25 UTC (permalink / raw) To: ranjith kumar; +Cc: gcc-help ranjith kumar <ranjit_kumar_b4u@yahoo.co.uk> writes: > 1) Will "gcc" insert prefetch instructions > automatically on "pentium 4" processor? > Which flags should be enabled while compiling sothat > gcc automatically insert prefetch instructions? gcc will insert prefetch instructions if you use the -fprefetch-loop-arrays option. > 2) Or programmer has to include some functions? > If so, what is the syntax of that function? You can use the __builtin_prefetch function. See the documentation. Ian ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2006-12-05 20:34 UTC | newest] Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2006-11-28 10:17 prefetching on pentium 4 ranjith kumar 2006-11-28 13:45 ` Tim Prince 2006-12-05 18:54 ` ranjith kumar 2006-12-05 20:34 ` Tim Prince 2006-11-28 15:25 ` Ian Lance Taylor
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).