From: bobk <bobklepko@yahoo.com>
To: gcc-help@gcc.gnu.org
Subject: HELP With Slow SSE Code
Date: Tue, 06 Jun 2006 00:19:00 -0000 [thread overview]
Message-ID: <4724748.post@talk.nabble.com> (raw)
I am new to the world of SSE, but in trying to speed up some C code I have
run into a wall which is both perplexing and frustrating (since I can't find
a solution). I am hoping someone here can provide the help I seek. I thank
you for all your assistance!
My (watered down version) code is as follows (running on a pentium4 based
machine and compiling with gcc 4.02 using the compile options:
-O3 -Wall -march=pentium4 -msse2 -mfpmath=sse):
// standard C #include files are put here
#include <emmintrin.h> // I will actually eventually be using sse2 and
// sse instructions
#include <mm_malloc.h>
void main()
{
float *ptr1,*ptr2,*ptr3,*tptr1,*tptr2;
__m128 m1,m2,m3,*sptr1,*sptr2,*sptr3;
int i,j,arraysize=1000,loopcount=10;
// allocate space for dynamic arrays that are aligned to 16-byte boundary
(note that arraysize will actually be read into this program in the final
version).
ptr1=(float *) __mm_malloc(arraysize*sizeof(float),16);
ptr2=(float *) __mm_malloc(arraysize*sizeof(float),16);
ptr3=(float *) __mm_malloc(arraysize*sizeof(float),16);
tptr1=ptr1;
tptr2=ptr2;
// fill in two of the arrays with some numbers
for(i=0;i<arraysize;i++,tptr1++,tptr2++)
{
*tptr1=(float)rand();
*tptr2=(float)rand();
}
// TIMING LOOP STARTS
for(i=0;i<loopcount;i++)
{
sptr1=(__m128) ptr1; // cast to size 128 bits
sptr2=(__m128) ptr2;
sptr3=(__m128_ ptr3;
for(j=0;j<arraysize;j++,stptr1++,stptr2++,sptr3++)
{
m1=*sptr1;
m2=*sptr2;
m3=_mm_mul_ps(m1,m2); // use SSE intrinsic instruction to
// multiply two numbers (note that even if I use *sptrx
// instead of mx I will get the same speed problem).
*sptr3=m3;
}
}
// TIMING LOOP ENDS HERE
}
So my speed problem is as follows. Without the line "*sptr3=m3;" the TIMING
LOOP works as expected. That is, four times faster than if I used normal
float values instead of quad sized float values (i.e. __m128). With the line
"*sptr3=m3;" inside this TIMING LOOP the code runs about 3 times slower than
when using normal float values. For some reason writing to the pointer
location of type __m128 seems to slow things down, but reading from it is
fine (e.g. line "m1=*sptr1;"). If I write the computed/multiplied data to a
static array (but I truly need a dynamic array) such as
x.m[j*i]=m3; // that is, replace line *sptr3=m3 with this line
where , say
union {
__m128m m[1000*10];
float f[1000*10][4];
} x
then the program runs as fast as expected. So what may I be doing wrong
with my code such that I do not effectively take advantage of SSE
capabilities in the pentium 4?
--
View this message in context: http://www.nabble.com/HELP-With-Slow-SSE-Code-t1738578.html#a4724748
Sent from the gcc - Help forum at Nabble.com.
reply other threads:[~2006-06-06 0:19 UTC|newest]
Thread overview: [no followups] expand[flat|nested] mbox.gz Atom feed
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4724748.post@talk.nabble.com \
--to=bobklepko@yahoo.com \
--cc=gcc-help@gcc.gnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).