public inbox for gcc-help@gcc.gnu.org
 help / color / mirror / Atom feed
* Why worse performace in euclidean distance with SSE2?
@ 2008-04-07 14:09 Dario Bahena Tapia
  2008-04-07 15:23 ` Dario Saccavino
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Dario Bahena Tapia @ 2008-04-07 14:09 UTC (permalink / raw)
  To: gcc-help

Hello,

I have just begun to play with SSE2 and gcc intrinsics. Indeed, maybe
this question is not exactly about gcc  ... but I think gcc lists are
a very good place to find help from  hardcore assembler hackers ;-1

I have a program which makes heavy usage of euclidean distance
function. The serial version is:

inline static double dist(int i,int j)
{
  double xd = C[i][X] - C[j][X];
  double yd = C[i][Y] - C[j][Y];
  return rint(sqrt(xd*xd + yd*yd));
}

As you can see each C[i] is an array of two double which represents a
2D vector (indexes 0 and 1 are coordinates X,Y respectively). I tried
to vectorize the function using SSE2 and gcc intrinsics, here is the
result:

inline static double dist_sse(int i,int j)
{
  double d;
  __m128d xmm0,xmm1;
  xmm0 =_mm_load_pd(C[i]);
  xmm1 = _mm_load_pd(C[j]);
  xmm0 = _mm_sub_pd(xmm0,xmm1);
  xmm1 = xmm0;
  xmm0 = _mm_mul_pd(xmm0,xmm1);
  xmm1 = _mm_shuffle_pd(xmm0, xmm0, _MM_SHUFFLE2(1, 1));
  xmm0 = _mm_add_pd(xmm0,xmm1);
  xmm0 = _mm_sqrt_pd(xmm0);
  _mm_store_sd(&d,xmm0);
  return rint(d);
}

Of course each C[i] was aligned as SSE2 expects:

for(i=0; i<D; i++)
 C[i] = (double *) _mm_malloc(2 * sizeof(double), 16);

And in order to activate the SSE2 features, I am using the following
flags for gcc (my computer is a laptop):

CFLAGS = -O -Wall -march=pentium-m -msse2

The vectorized version of the function seems to be correct, given it
provides same results as serial counterpart. However, the performace
is poor; execution time of program increases in approximately 50% (for
example, in calculating the distance of every pair of points from a
set of 10,000, the serial version takes around 8 seconds while
vectorized flavor takes 12).

I was expecting a better time given that:

1. The difference of X and Y is done in parallel
2. The product of each difference coordinate with itself is also done
in parallel
3. The sqrt function used is hardware implemented (although serial
sqrt implementation could also take advantage of hardware)

I suppose the problem here is my lack of experience programming in
assembler in general, and in particular with SSE2. Therefore, I am
looking for advice.

Thank you.

Regards
Dario, the jackal.

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2008-04-08 14:56 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-04-07 14:09 Why worse performace in euclidean distance with SSE2? Dario Bahena Tapia
2008-04-07 15:23 ` Dario Saccavino
2008-04-07 16:09   ` Dario Bahena Tapia
2008-04-07 16:41     ` Dario Bahena Tapia
2008-04-07 16:05 ` jlh
2008-04-07 17:02   ` Dario Bahena Tapia
2008-04-07 23:42     ` Brian Budge
2008-04-08  2:15       ` Dario Bahena Tapia
2008-04-08  8:34 ` Zuxy Meng
2008-04-08 15:57   ` Dario Bahena Tapia

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).