array math.h (SSE2?)

public inbox for gsl-discuss@sourceware.org
 help / color / mirror / Atom feed

* array math.h (SSE2?)
@ 2006-03-06 21:53 James Bergstra
  2006-03-15 18:14 ` Andre Lehovich
  2006-03-25 14:42 ` John D Lamb
  0 siblings, 2 replies; 4+ messages in thread
From: James Bergstra @ 2006-03-06 21:53 UTC (permalink / raw)
  To: gsl-discuss

Does anyone know of a library, or source file somewhere in which functions from
math.h such as exp() and log() are implemented using SSE2?  AMD's libacml_mv
demonstrates that such an implementation can be faster than gcc's, and
furthermore that multiple(2) functions can be computed in parallel on a single
processor. (Making 2-3 fold speed improvements on array calculations).

Or, taking a step back, is there a project analagous to ATLAS or FFTW that
provides fast array computations of these functions on various architectures?

-- 
james bergstra
http://www-etud.iro.umontreal.ca/~bergstrj

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: array math.h (SSE2?)
  2006-03-06 21:53 array math.h (SSE2?) James Bergstra
@ 2006-03-15 18:14 ` Andre Lehovich
  2006-03-25 14:42 ` John D Lamb
  1 sibling, 0 replies; 4+ messages in thread
From: Andre Lehovich @ 2006-03-15 18:14 UTC (permalink / raw)
  To: James Bergstra, gsl-discuss

--- James Bergstra <james.bergstra@umontreal.ca> wrote:
> Or, taking a step back, is there a project analagous to ATLAS or FFTW that
> provides fast array computations of these functions on various architectures?

The library of optimized inner loops (liboil) looks like it has some
subroutines along those lines.  The focus seems to be signal-processing apps.

http://liboil.freedesktop.org/wiki/

--Andre

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: array math.h (SSE2?)
  2006-03-06 21:53 array math.h (SSE2?) James Bergstra
  2006-03-15 18:14 ` Andre Lehovich
@ 2006-03-25 14:42 ` John D Lamb
  2006-03-25 18:34   ` James Bergstra
  1 sibling, 1 reply; 4+ messages in thread
From: John D Lamb @ 2006-03-25 14:42 UTC (permalink / raw)
  To: gsl-discuss

James Bergstra wrote:
> Does anyone know of a library, or source file somewhere in which functions from
> math.h such as exp() and log() are implemented using SSE2?  AMD's libacml_mv
> demonstrates that such an implementation can be faster than gcc's, and
> furthermore that multiple(2) functions can be computed in parallel on a single
> processor. (Making 2-3 fold speed improvements on array calculations).
> 
> Or, taking a step back, is there a project analagous to ATLAS or FFTW that
> provides fast array computations of these functions on various architectures?
> 

It wouldn't be too hard to write some of these. As an example,
#define SQRT( x, result ) ({ \
      __asm__( "fsqrt\n\t" \
	       : "=t"( result ) \
	       : "0"( x ) ); })
is about 2-3 times as fast as std::sqrt and (I think) gives identical
results for doubles other than std::numeric_limits<infinity>() (which is
easily fixed).

However, two issues:
1. GSL is pretty well optimised if you use the appropriate flags for
pentium 4 SSE2: -march=pentium4 -malign-double -mfpmath=sse -msse -msse2
If you compile and check the code you will find a big speed improvement
and the SSE2-specific instructions already in the assembly that's
generated. I haven't checked the special functions (e.g. cos, exp) but
there's clearly room for gcc to use SSE2 optimisations (though not the
fcos op because it doesn't give the same answer as gsl_sf_cos).
2. Since gsl allows vector and matrix views, the doubles (assuming
you're using doubles) you might want to add are not necessarily stored
in contiguous memory, which limits the use of movapd and so limits the
value of addsd, subsd and mulsd to speed up calculations.

Of course, there's nothing to stop you writing your own functions that
operate on gsl_blocks and give you much faster arithmetic.

-- 
JDL

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: array math.h (SSE2?)
  2006-03-25 14:42 ` John D Lamb
@ 2006-03-25 18:34   ` James Bergstra
  0 siblings, 0 replies; 4+ messages in thread
From: James Bergstra @ 2006-03-25 18:34 UTC (permalink / raw)
  To: John D Lamb; +Cc: gsl-discuss

> James Bergstra wrote:
> > Does anyone know of a library, or source file somewhere in which functions from
> > math.h such as exp() and log() are implemented using SSE2?  AMD's libacml_mv
> > demonstrates that such an implementation can be faster than gcc's, and
> > furthermore that multiple(2) functions can be computed in parallel on a single
> > processor. (Making 2-3 fold speed improvements on array calculations).
> > 
> > Or, taking a step back, is there a project analagous to ATLAS or FFTW that
> > provides fast array computations of these functions on various architectures?
> > 
On Sat, Mar 25, 2006 at 02:41:56PM +0000, John D Lamb wrote:
> It wouldn't be too hard to write some of these. As an example,
> #define SQRT( x, result ) ({ \
>       __asm__( "fsqrt\n\t" \
> 	       : "=t"( result ) \
> 	       : "0"( x ) ); })
> is about 2-3 times as fast as std::sqrt and (I think) gives identical
> results for doubles other than std::numeric_limits<infinity>() (which is
> easily fixed).
Thank you for your suggestion.  The catch I see when extending this to exp()
(which is the function I want most!) is that the x86 instructions handle one
number at a time, and that becomes the bottleneck.
Do you know if anyone has implemented anything like, for eg., the method
described in :

"Evaluation of Elementary Functions using Multimedia Features", Parallel and
Distributed Processing Symposium, 2004?

> However, two issues:
> 1. GSL is pretty well optimised if you use the appropriate flags for
> pentium 4 SSE2: -march=pentium4 -malign-double -mfpmath=sse -msse -msse2
Good point, I'm not sure if I did this.  Maybe a big oversight!

> 2. Since gsl allows vector and matrix views, the doubles (assuming
> you're using doubles) you might want to add are not necessarily stored
> in contiguous memory, which limits the use of movapd and so limits the
> value of addsd, subsd and mulsd to speed up calculations.
True, but at the same time, the matrix format *does* guarantee contiguous rows,
and I think matrices and vectors are contiguous often enough to warrant
attention. (And I found movntpd to be most helpful of all... does GCC ever
generate this instruction?)

> Of course, there's nothing to stop you writing your own functions that
> operate on gsl_blocks and give you much faster arithmetic.
That's basically what I'm doing... although I thought putting gsl_block* in the
function declaration would make it inconvenient.  Just 
fn (size_t dim, double*data)

-- 
james bergstra
http://www-etud.iro.umontreal.ca/~bergstrj

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2006-03-25 18:34 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-03-06 21:53 array math.h (SSE2?) James Bergstra
2006-03-15 18:14 ` Andre Lehovich
2006-03-25 14:42 ` John D Lamb
2006-03-25 18:34   ` James Bergstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).