public inbox for gcc-help@gcc.gnu.org
 help / color / mirror / Atom feed
From: "Zuxy Meng" <zuxy.meng@gmail.com>
To: gcc-help@gcc.gnu.org
Subject: Re: Why worse performace in euclidean distance with SSE2?
Date: Tue, 08 Apr 2008 08:34:00 -0000	[thread overview]
Message-ID: <ftf5j3$g1j$1@ger.gmane.org> (raw)
In-Reply-To: <3d104d6f0804070617u47213cc8nbc697dab9dc262b5@mail.gmail.com>

Hi,

"Dario Bahena Tapia" <dario.mx@gmail.com> 写入消息新闻:3d104d6f0804070617u47213cc8nbc697dab9dc262b5@mail.gmail.com...
> Hello,
>
> I have just begun to play with SSE2 and gcc intrinsics. Indeed, maybe
> this question is not exactly about gcc  ... but I think gcc lists are
> a very good place to find help from  hardcore assembler hackers ;-1
>
> I have a program which makes heavy usage of euclidean distance
> function. The serial version is:
>
> inline static double dist(int i,int j)
> {
>  double xd = C[i][X] - C[j][X];
>  double yd = C[i][Y] - C[j][Y];
>  return rint(sqrt(xd*xd + yd*yd));
> }
>
> As you can see each C[i] is an array of two double which represents a
> 2D vector (indexes 0 and 1 are coordinates X,Y respectively). I tried
> to vectorize the function using SSE2 and gcc intrinsics, here is the
> result:
>
> inline static double dist_sse(int i,int j)
> {
>  double d;
>  __m128d xmm0,xmm1;
>  xmm0 =_mm_load_pd(C[i]);
>  xmm1 = _mm_load_pd(C[j]);
>  xmm0 = _mm_sub_pd(xmm0,xmm1);
>  xmm1 = xmm0;
>  xmm0 = _mm_mul_pd(xmm0,xmm1);
>  xmm1 = _mm_shuffle_pd(xmm0, xmm0, _MM_SHUFFLE2(1, 1));
>  xmm0 = _mm_add_pd(xmm0,xmm1);
>  xmm0 = _mm_sqrt_pd(xmm0);
>  _mm_store_sd(&d,xmm0);
>  return rint(d);
> }
>
> Of course each C[i] was aligned as SSE2 expects:
>
> for(i=0; i<D; i++)
> C[i] = (double *) _mm_malloc(2 * sizeof(double), 16);
>
> And in order to activate the SSE2 features, I am using the following
> flags for gcc (my computer is a laptop):
>
> CFLAGS = -O -Wall -march=pentium-m -msse2
>
> The vectorized version of the function seems to be correct, given it
> provides same results as serial counterpart. However, the performace
> is poor; execution time of program increases in approximately 50% (for
> example, in calculating the distance of every pair of points from a
> set of 10,000, the serial version takes around 8 seconds while
> vectorized flavor takes 12).
>
> I was expecting a better time given that:
>
> 1. The difference of X and Y is done in parallel
> 2. The product of each difference coordinate with itself is also done
> in parallel
> 3. The sqrt function used is hardware implemented (although serial
> sqrt implementation could also take advantage of hardware)
>
> I suppose the problem here is my lack of experience programming in
> assembler in general, and in particular with SSE2. Therefore, I am
> looking for advice.

1. First of all, you didn't extract the parallelism in your algorithm. SSE2 
won't help you if all you want is to pick up two points at random indices 
and calculate the distance. However it will help you a lot when you 
calculate the distances between a given point and 1 million others whose 
indices are sequential.

2. Unroll the loop to hide the latency of square root as much as possible.

3. Since the final result is an integer, you may consider using "float" 
instead of "double". That'll give you a performance boost even without SSE2. 
And rsqrtps comes in handy too, if its precision is acceptable.

-- 
Zuxy 


  parent reply	other threads:[~2008-04-08  7:07 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-04-07 14:09 Dario Bahena Tapia
2008-04-07 15:23 ` Dario Saccavino
2008-04-07 16:09   ` Dario Bahena Tapia
2008-04-07 16:41     ` Dario Bahena Tapia
2008-04-07 16:05 ` jlh
2008-04-07 17:02   ` Dario Bahena Tapia
2008-04-07 23:42     ` Brian Budge
2008-04-08  2:15       ` Dario Bahena Tapia
2008-04-08  8:34 ` Zuxy Meng [this message]
2008-04-08 15:57   ` Dario Bahena Tapia

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='ftf5j3$g1j$1@ger.gmane.org' \
    --to=zuxy.meng@gmail.com \
    --cc=gcc-help@gcc.gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).