public inbox for gcc-help@gcc.gnu.org
 help / color / mirror / Atom feed
* Why worse performace in euclidean distance with SSE2?
@ 2008-04-07 14:09 Dario Bahena Tapia
  2008-04-07 15:23 ` Dario Saccavino
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Dario Bahena Tapia @ 2008-04-07 14:09 UTC (permalink / raw)
  To: gcc-help

Hello,

I have just begun to play with SSE2 and gcc intrinsics. Indeed, maybe
this question is not exactly about gcc  ... but I think gcc lists are
a very good place to find help from  hardcore assembler hackers ;-1

I have a program which makes heavy usage of euclidean distance
function. The serial version is:

inline static double dist(int i,int j)
{
  double xd = C[i][X] - C[j][X];
  double yd = C[i][Y] - C[j][Y];
  return rint(sqrt(xd*xd + yd*yd));
}

As you can see each C[i] is an array of two double which represents a
2D vector (indexes 0 and 1 are coordinates X,Y respectively). I tried
to vectorize the function using SSE2 and gcc intrinsics, here is the
result:

inline static double dist_sse(int i,int j)
{
  double d;
  __m128d xmm0,xmm1;
  xmm0 =_mm_load_pd(C[i]);
  xmm1 = _mm_load_pd(C[j]);
  xmm0 = _mm_sub_pd(xmm0,xmm1);
  xmm1 = xmm0;
  xmm0 = _mm_mul_pd(xmm0,xmm1);
  xmm1 = _mm_shuffle_pd(xmm0, xmm0, _MM_SHUFFLE2(1, 1));
  xmm0 = _mm_add_pd(xmm0,xmm1);
  xmm0 = _mm_sqrt_pd(xmm0);
  _mm_store_sd(&d,xmm0);
  return rint(d);
}

Of course each C[i] was aligned as SSE2 expects:

for(i=0; i<D; i++)
 C[i] = (double *) _mm_malloc(2 * sizeof(double), 16);

And in order to activate the SSE2 features, I am using the following
flags for gcc (my computer is a laptop):

CFLAGS = -O -Wall -march=pentium-m -msse2

The vectorized version of the function seems to be correct, given it
provides same results as serial counterpart. However, the performace
is poor; execution time of program increases in approximately 50% (for
example, in calculating the distance of every pair of points from a
set of 10,000, the serial version takes around 8 seconds while
vectorized flavor takes 12).

I was expecting a better time given that:

1. The difference of X and Y is done in parallel
2. The product of each difference coordinate with itself is also done
in parallel
3. The sqrt function used is hardware implemented (although serial
sqrt implementation could also take advantage of hardware)

I suppose the problem here is my lack of experience programming in
assembler in general, and in particular with SSE2. Therefore, I am
looking for advice.

Thank you.

Regards
Dario, the jackal.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Why worse performace in euclidean distance with SSE2?
  2008-04-07 14:09 Why worse performace in euclidean distance with SSE2? Dario Bahena Tapia
@ 2008-04-07 15:23 ` Dario Saccavino
  2008-04-07 16:09   ` Dario Bahena Tapia
  2008-04-07 16:05 ` jlh
  2008-04-08  8:34 ` Zuxy Meng
  2 siblings, 1 reply; 10+ messages in thread
From: Dario Saccavino @ 2008-04-07 15:23 UTC (permalink / raw)
  To: Dario Bahena Tapia; +Cc: gcc-help

Hello Dario,

I haven't tried your code yet but I think you could get a good boost
if you replace the "sqrt_pd" call with "sqrt_sd", since you only need
the square root of a scalar.

    Dario


>
>  inline static double dist_sse(int i,int j)
>  {
>   double d;
>   __m128d xmm0,xmm1;
>   xmm0 =_mm_load_pd(C[i]);
>   xmm1 = _mm_load_pd(C[j]);
>   xmm0 = _mm_sub_pd(xmm0,xmm1);
>   xmm1 = xmm0;
>   xmm0 = _mm_mul_pd(xmm0,xmm1);
>   xmm1 = _mm_shuffle_pd(xmm0, xmm0, _MM_SHUFFLE2(1, 1));
>   xmm0 = _mm_add_pd(xmm0,xmm1);
>   xmm0 = _mm_sqrt_pd(xmm0);
>   _mm_store_sd(&d,xmm0);
>   return rint(d);
>  }
>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Why worse performace in euclidean distance with SSE2?
  2008-04-07 14:09 Why worse performace in euclidean distance with SSE2? Dario Bahena Tapia
  2008-04-07 15:23 ` Dario Saccavino
@ 2008-04-07 16:05 ` jlh
  2008-04-07 17:02   ` Dario Bahena Tapia
  2008-04-08  8:34 ` Zuxy Meng
  2 siblings, 1 reply; 10+ messages in thread
From: jlh @ 2008-04-07 16:05 UTC (permalink / raw)
  To: Dario Bahena Tapia, gcc-help

Dario Bahena Tapia wrote:
> inline static double dist(int i,int j)
> {
>   double xd = C[i][X] - C[j][X];
>   double yd = C[i][Y] - C[j][Y];
>   return rint(sqrt(xd*xd + yd*yd));
> }
> [...]
> And in order to activate the SSE2 features, I am using the following
> flags for gcc (my computer is a laptop):
> 
> CFLAGS = -O -Wall -march=pentium-m -msse2

These options do not make dist() use any SSE for me.  Have you
tried compiling with this?

CFLAGS = -O2 -Wall -march=pentium-m -mfpmath=sse

I think -msse2 is redundant if you say -march-pentium-m.  I don't
have an SSE2 machine to try this though.

jlh

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Why worse performace in euclidean distance with SSE2?
  2008-04-07 15:23 ` Dario Saccavino
@ 2008-04-07 16:09   ` Dario Bahena Tapia
  2008-04-07 16:41     ` Dario Bahena Tapia
  0 siblings, 1 reply; 10+ messages in thread
From: Dario Bahena Tapia @ 2008-04-07 16:09 UTC (permalink / raw)
  To: Dario Saccavino; +Cc: gcc-help

Oh you are correct ... that improved a lot. However, it still runs
slower than serial version, about 1 second more for the 10,000 data
example.

Thanks.

On Mon, Apr 7, 2008 at 9:08 AM, Dario Saccavino <kathoum@gmail.com> wrote:
> Hello Dario,
>
>  I haven't tried your code yet but I think you could get a good boost
>  if you replace the "sqrt_pd" call with "sqrt_sd", since you only need
>  the square root of a scalar.
>
>     Dario
>
>
>
>
>  >
>  >  inline static double dist_sse(int i,int j)
>  >  {
>  >   double d;
>  >   __m128d xmm0,xmm1;
>  >   xmm0 =_mm_load_pd(C[i]);
>  >   xmm1 = _mm_load_pd(C[j]);
>  >   xmm0 = _mm_sub_pd(xmm0,xmm1);
>  >   xmm1 = xmm0;
>  >   xmm0 = _mm_mul_pd(xmm0,xmm1);
>  >   xmm1 = _mm_shuffle_pd(xmm0, xmm0, _MM_SHUFFLE2(1, 1));
>  >   xmm0 = _mm_add_pd(xmm0,xmm1);
>  >   xmm0 = _mm_sqrt_pd(xmm0);
>  >   _mm_store_sd(&d,xmm0);
>  >   return rint(d);
>  >  }
>  >
>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Why worse performace in euclidean distance with SSE2?
  2008-04-07 16:09   ` Dario Bahena Tapia
@ 2008-04-07 16:41     ` Dario Bahena Tapia
  0 siblings, 0 replies; 10+ messages in thread
From: Dario Bahena Tapia @ 2008-04-07 16:41 UTC (permalink / raw)
  To: Dario Saccavino; +Cc: gcc-help

Oh you are correct ... that improved a lot. However, it still runs
slower than serial version, about 1 second more for the 10,000 data
example.

Thanks.

On Mon, Apr 7, 2008 at 9:08 AM, Dario Saccavino <kathoum@gmail.com> wrote:
> Hello Dario,
>
>  I haven't tried your code yet but I think you could get a good boost
>  if you replace the "sqrt_pd" call with "sqrt_sd", since you only need
>  the square root of a scalar.
>
>     Dario
>
>
>
>
>  >
>  >  inline static double dist_sse(int i,int j)
>  >  {
>  >   double d;
>  >   __m128d xmm0,xmm1;
>  >   xmm0 =_mm_load_pd(C[i]);
>  >   xmm1 = _mm_load_pd(C[j]);
>  >   xmm0 = _mm_sub_pd(xmm0,xmm1);
>  >   xmm1 = xmm0;
>  >   xmm0 = _mm_mul_pd(xmm0,xmm1);
>  >   xmm1 = _mm_shuffle_pd(xmm0, xmm0, _MM_SHUFFLE2(1, 1));
>  >   xmm0 = _mm_add_pd(xmm0,xmm1);
>  >   xmm0 = _mm_sqrt_pd(xmm0);
>  >   _mm_store_sd(&d,xmm0);
>  >   return rint(d);
>  >  }
>  >
>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Why worse performace in euclidean distance with SSE2?
  2008-04-07 16:05 ` jlh
@ 2008-04-07 17:02   ` Dario Bahena Tapia
  2008-04-07 23:42     ` Brian Budge
  0 siblings, 1 reply; 10+ messages in thread
From: Dario Bahena Tapia @ 2008-04-07 17:02 UTC (permalink / raw)
  To: jlh; +Cc: gcc-help

Hello,

I tried with your options but it seems to make no difference. In
another email it was suggested to use _mm_sqrt_sd, because I only
needed one sqrt calculation. That improved time and indeed, almost
reach serial version (now it runs up to 1 second slower for the 10,000
data example, hehe).

But of course, I would wanna/expect the vector version to run faster
... still unsure how to achieve that.

Thanks

On Mon, Apr 7, 2008 at 10:23 AM, jlh <jlh@gmx.ch> wrote:
> Dario Bahena Tapia wrote:
>
> >
> > inline static double dist(int i,int j)
> > {
> >  double xd = C[i][X] - C[j][X];
> >  double yd = C[i][Y] - C[j][Y];
> >  return rint(sqrt(xd*xd + yd*yd));
> > }
> > [...]
> >
> > And in order to activate the SSE2 features, I am using the following
> > flags for gcc (my computer is a laptop):
> >
> > CFLAGS = -O -Wall -march=pentium-m -msse2
> >
>
>  These options do not make dist() use any SSE for me.  Have you
>  tried compiling with this?
>
>  CFLAGS = -O2 -Wall -march=pentium-m -mfpmath=sse
>
>  I think -msse2 is redundant if you say -march-pentium-m.  I don't
>  have an SSE2 machine to try this though.
>
>  jlh
>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Why worse performace in euclidean distance with SSE2?
  2008-04-07 17:02   ` Dario Bahena Tapia
@ 2008-04-07 23:42     ` Brian Budge
  2008-04-08  2:15       ` Dario Bahena Tapia
  0 siblings, 1 reply; 10+ messages in thread
From: Brian Budge @ 2008-04-07 23:42 UTC (permalink / raw)
  To: Dario Bahena Tapia; +Cc: jlh, gcc-help

In my experience, SSE is generally more useful when you can optimize
your structures as SOA (struct of array) vs AOS (array of struct).  If
you expect a speed up by doing individual groups of pairs of doubles,
I doubt you'll see much improvement except in extreme situations, or
when the compiler might detect a pattern in your code.  Also, shuffles
etc... are killers.

Much better would be if you had 10000 of these things to take
distances at once, and you could lay out the data friendlier for SSE
(SOA).

  Brian

On Mon, Apr 7, 2008 at 9:08 AM, Dario Bahena Tapia <dario.mx@gmail.com> wrote:
> Hello,
>
>  I tried with your options but it seems to make no difference. In
>  another email it was suggested to use _mm_sqrt_sd, because I only
>  needed one sqrt calculation. That improved time and indeed, almost
>  reach serial version (now it runs up to 1 second slower for the 10,000
>  data example, hehe).
>
>  But of course, I would wanna/expect the vector version to run faster
>  ... still unsure how to achieve that.
>
>  Thanks
>
>
>
>  On Mon, Apr 7, 2008 at 10:23 AM, jlh <jlh@gmx.ch> wrote:
>  > Dario Bahena Tapia wrote:
>  >
>  > >
>  > > inline static double dist(int i,int j)
>  > > {
>  > >  double xd = C[i][X] - C[j][X];
>  > >  double yd = C[i][Y] - C[j][Y];
>  > >  return rint(sqrt(xd*xd + yd*yd));
>  > > }
>  > > [...]
>  > >
>  > > And in order to activate the SSE2 features, I am using the following
>  > > flags for gcc (my computer is a laptop):
>  > >
>  > > CFLAGS = -O -Wall -march=pentium-m -msse2
>  > >
>  >
>  >  These options do not make dist() use any SSE for me.  Have you
>  >  tried compiling with this?
>  >
>  >  CFLAGS = -O2 -Wall -march=pentium-m -mfpmath=sse
>  >
>  >  I think -msse2 is redundant if you say -march-pentium-m.  I don't
>  >  have an SSE2 machine to try this though.
>  >
>  >  jlh
>  >
>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Why worse performace in euclidean distance with SSE2?
  2008-04-07 23:42     ` Brian Budge
@ 2008-04-08  2:15       ` Dario Bahena Tapia
  0 siblings, 0 replies; 10+ messages in thread
From: Dario Bahena Tapia @ 2008-04-08  2:15 UTC (permalink / raw)
  To: Brian Budge; +Cc: gcc-help

Hello,

Think I concur, indeed, original program had structure of arrays (each
coordinate in separate array). Will try to use SSE2 over that flavor,
although I think sqrt will still be the bottleneck ... maybe I could
use also another norm function (like maximum or taxicab).

Thanks.


On Mon, Apr 7, 2008 at 5:51 PM, Brian Budge <brian.budge@gmail.com> wrote:
> In my experience, SSE is generally more useful when you can optimize
>  your structures as SOA (struct of array) vs AOS (array of struct).  If
>  you expect a speed up by doing individual groups of pairs of doubles,
>  I doubt you'll see much improvement except in extreme situations, or
>  when the compiler might detect a pattern in your code.  Also, shuffles
>  etc... are killers.
>
>  Much better would be if you had 10000 of these things to take
>  distances at once, and you could lay out the data friendlier for SSE
>  (SOA).
>
>   Brian
>
>
>
>  On Mon, Apr 7, 2008 at 9:08 AM, Dario Bahena Tapia <dario.mx@gmail.com> wrote:
>  > Hello,
>  >
>  >  I tried with your options but it seems to make no difference. In
>  >  another email it was suggested to use _mm_sqrt_sd, because I only
>  >  needed one sqrt calculation. That improved time and indeed, almost
>  >  reach serial version (now it runs up to 1 second slower for the 10,000
>  >  data example, hehe).
>  >
>  >  But of course, I would wanna/expect the vector version to run faster
>  >  ... still unsure how to achieve that.
>  >
>  >  Thanks
>  >
>  >
>  >
>  >  On Mon, Apr 7, 2008 at 10:23 AM, jlh <jlh@gmx.ch> wrote:
>  >  > Dario Bahena Tapia wrote:
>  >  >
>  >  > >
>  >  > > inline static double dist(int i,int j)
>  >  > > {
>  >  > >  double xd = C[i][X] - C[j][X];
>  >  > >  double yd = C[i][Y] - C[j][Y];
>  >  > >  return rint(sqrt(xd*xd + yd*yd));
>  >  > > }
>  >  > > [...]
>  >  > >
>  >  > > And in order to activate the SSE2 features, I am using the following
>  >  > > flags for gcc (my computer is a laptop):
>  >  > >
>  >  > > CFLAGS = -O -Wall -march=pentium-m -msse2
>  >  > >
>  >  >
>  >  >  These options do not make dist() use any SSE for me.  Have you
>  >  >  tried compiling with this?
>  >  >
>  >  >  CFLAGS = -O2 -Wall -march=pentium-m -mfpmath=sse
>  >  >
>  >  >  I think -msse2 is redundant if you say -march-pentium-m.  I don't
>  >  >  have an SSE2 machine to try this though.
>  >  >
>  >  >  jlh
>  >  >
>  >
>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Why worse performace in euclidean distance with SSE2?
  2008-04-07 14:09 Why worse performace in euclidean distance with SSE2? Dario Bahena Tapia
  2008-04-07 15:23 ` Dario Saccavino
  2008-04-07 16:05 ` jlh
@ 2008-04-08  8:34 ` Zuxy Meng
  2008-04-08 15:57   ` Dario Bahena Tapia
  2 siblings, 1 reply; 10+ messages in thread
From: Zuxy Meng @ 2008-04-08  8:34 UTC (permalink / raw)
  To: gcc-help

Hi,

"Dario Bahena Tapia" <dario.mx@gmail.com> 写入消息新闻:3d104d6f0804070617u47213cc8nbc697dab9dc262b5@mail.gmail.com...
> Hello,
>
> I have just begun to play with SSE2 and gcc intrinsics. Indeed, maybe
> this question is not exactly about gcc  ... but I think gcc lists are
> a very good place to find help from  hardcore assembler hackers ;-1
>
> I have a program which makes heavy usage of euclidean distance
> function. The serial version is:
>
> inline static double dist(int i,int j)
> {
>  double xd = C[i][X] - C[j][X];
>  double yd = C[i][Y] - C[j][Y];
>  return rint(sqrt(xd*xd + yd*yd));
> }
>
> As you can see each C[i] is an array of two double which represents a
> 2D vector (indexes 0 and 1 are coordinates X,Y respectively). I tried
> to vectorize the function using SSE2 and gcc intrinsics, here is the
> result:
>
> inline static double dist_sse(int i,int j)
> {
>  double d;
>  __m128d xmm0,xmm1;
>  xmm0 =_mm_load_pd(C[i]);
>  xmm1 = _mm_load_pd(C[j]);
>  xmm0 = _mm_sub_pd(xmm0,xmm1);
>  xmm1 = xmm0;
>  xmm0 = _mm_mul_pd(xmm0,xmm1);
>  xmm1 = _mm_shuffle_pd(xmm0, xmm0, _MM_SHUFFLE2(1, 1));
>  xmm0 = _mm_add_pd(xmm0,xmm1);
>  xmm0 = _mm_sqrt_pd(xmm0);
>  _mm_store_sd(&d,xmm0);
>  return rint(d);
> }
>
> Of course each C[i] was aligned as SSE2 expects:
>
> for(i=0; i<D; i++)
> C[i] = (double *) _mm_malloc(2 * sizeof(double), 16);
>
> And in order to activate the SSE2 features, I am using the following
> flags for gcc (my computer is a laptop):
>
> CFLAGS = -O -Wall -march=pentium-m -msse2
>
> The vectorized version of the function seems to be correct, given it
> provides same results as serial counterpart. However, the performace
> is poor; execution time of program increases in approximately 50% (for
> example, in calculating the distance of every pair of points from a
> set of 10,000, the serial version takes around 8 seconds while
> vectorized flavor takes 12).
>
> I was expecting a better time given that:
>
> 1. The difference of X and Y is done in parallel
> 2. The product of each difference coordinate with itself is also done
> in parallel
> 3. The sqrt function used is hardware implemented (although serial
> sqrt implementation could also take advantage of hardware)
>
> I suppose the problem here is my lack of experience programming in
> assembler in general, and in particular with SSE2. Therefore, I am
> looking for advice.

1. First of all, you didn't extract the parallelism in your algorithm. SSE2 
won't help you if all you want is to pick up two points at random indices 
and calculate the distance. However it will help you a lot when you 
calculate the distances between a given point and 1 million others whose 
indices are sequential.

2. Unroll the loop to hide the latency of square root as much as possible.

3. Since the final result is an integer, you may consider using "float" 
instead of "double". That'll give you a performance boost even without SSE2. 
And rsqrtps comes in handy too, if its precision is acceptable.

-- 
Zuxy 


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Why worse performace in euclidean distance with SSE2?
  2008-04-08  8:34 ` Zuxy Meng
@ 2008-04-08 15:57   ` Dario Bahena Tapia
  0 siblings, 0 replies; 10+ messages in thread
From: Dario Bahena Tapia @ 2008-04-08 15:57 UTC (permalink / raw)
  To: Zuxy Meng; +Cc: gcc-help

Hello,

Yeah, others have suggested as well changing the way i process them in
order to allow for that. Working there ;-|

Will consider the other suggestions as well !!!

Thanks.


On Tue, Apr 8, 2008 at 2:07 AM, Zuxy Meng <zuxy.meng@gmail.com> wrote:
> Hi,
>
>  "Dario Bahena Tapia" <dario.mx@gmail.com>
> 写入消息新闻:3d104d6f0804070617u47213cc8nbc697dab9dc262b5@mail.gmail.com...
>
>
>
> > Hello,
> >
> > I have just begun to play with SSE2 and gcc intrinsics. Indeed, maybe
> > this question is not exactly about gcc  ... but I think gcc lists are
> > a very good place to find help from  hardcore assembler hackers ;-1
> >
> > I have a program which makes heavy usage of euclidean distance
> > function. The serial version is:
> >
> > inline static double dist(int i,int j)
> > {
> >  double xd = C[i][X] - C[j][X];
> >  double yd = C[i][Y] - C[j][Y];
> >  return rint(sqrt(xd*xd + yd*yd));
> > }
> >
> > As you can see each C[i] is an array of two double which represents a
> > 2D vector (indexes 0 and 1 are coordinates X,Y respectively). I tried
> > to vectorize the function using SSE2 and gcc intrinsics, here is the
> > result:
> >
> > inline static double dist_sse(int i,int j)
> > {
> >  double d;
> >  __m128d xmm0,xmm1;
> >  xmm0 =_mm_load_pd(C[i]);
> >  xmm1 = _mm_load_pd(C[j]);
> >  xmm0 = _mm_sub_pd(xmm0,xmm1);
> >  xmm1 = xmm0;
> >  xmm0 = _mm_mul_pd(xmm0,xmm1);
> >  xmm1 = _mm_shuffle_pd(xmm0, xmm0, _MM_SHUFFLE2(1, 1));
> >  xmm0 = _mm_add_pd(xmm0,xmm1);
> >  xmm0 = _mm_sqrt_pd(xmm0);
> >  _mm_store_sd(&d,xmm0);
> >  return rint(d);
> > }
> >
> > Of course each C[i] was aligned as SSE2 expects:
> >
> > for(i=0; i<D; i++)
> > C[i] = (double *) _mm_malloc(2 * sizeof(double), 16);
> >
> > And in order to activate the SSE2 features, I am using the following
> > flags for gcc (my computer is a laptop):
> >
> > CFLAGS = -O -Wall -march=pentium-m -msse2
> >
> > The vectorized version of the function seems to be correct, given it
> > provides same results as serial counterpart. However, the performace
> > is poor; execution time of program increases in approximately 50% (for
> > example, in calculating the distance of every pair of points from a
> > set of 10,000, the serial version takes around 8 seconds while
> > vectorized flavor takes 12).
> >
> > I was expecting a better time given that:
> >
> > 1. The difference of X and Y is done in parallel
> > 2. The product of each difference coordinate with itself is also done
> > in parallel
> > 3. The sqrt function used is hardware implemented (although serial
> > sqrt implementation could also take advantage of hardware)
> >
> > I suppose the problem here is my lack of experience programming in
> > assembler in general, and in particular with SSE2. Therefore, I am
> > looking for advice.
> >
>
>  1. First of all, you didn't extract the parallelism in your algorithm. SSE2
> won't help you if all you want is to pick up two points at random indices
> and calculate the distance. However it will help you a lot when you
> calculate the distances between a given point and 1 million others whose
> indices are sequential.
>
>  2. Unroll the loop to hide the latency of square root as much as possible.
>
>  3. Since the final result is an integer, you may consider using "float"
> instead of "double". That'll give you a performance boost even without SSE2.
> And rsqrtps comes in handy too, if its precision is acceptable.
>
>  --
>  Zuxy
>
>

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2008-04-08 14:56 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-04-07 14:09 Why worse performace in euclidean distance with SSE2? Dario Bahena Tapia
2008-04-07 15:23 ` Dario Saccavino
2008-04-07 16:09   ` Dario Bahena Tapia
2008-04-07 16:41     ` Dario Bahena Tapia
2008-04-07 16:05 ` jlh
2008-04-07 17:02   ` Dario Bahena Tapia
2008-04-07 23:42     ` Brian Budge
2008-04-08  2:15       ` Dario Bahena Tapia
2008-04-08  8:34 ` Zuxy Meng
2008-04-08 15:57   ` Dario Bahena Tapia

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).