public inbox for gsl-discuss@sourceware.org
 help / color / mirror / Atom feed
* gslclapack
@ 2006-03-26  2:41 James Bergstra
  2006-03-28 11:48 ` gslclapack Brian Gough
  0 siblings, 1 reply; 24+ messages in thread
From: James Bergstra @ 2006-03-26  2:41 UTC (permalink / raw)
  To: gsl-discuss


Please excuse a sort of continued email (further to my last comment on cholesky
decomposition).  I notice that lapack implements cholesky decomposition of real
symmetric matrices in a file called dpotrf.f.  In that file, there is apparently
a 'blocked' version of the decomposition algorithm that uses level3 blas
routines instead of level2.

On one hand, I'm jealous that lapack has a fancy routine, and react "let's
implement that for gsl too" and on the other hand, I think "why is gsl
re-implementing (what I believe is) a very standard library?"

What would be the pros and cons of slowly migrating the number-crunching from
gsl_linalg routines to gsl_clapack ones, while leaving argument-checking to the
gsl_linalg ones.  I think the division works well in the case of gsl_blas vs.
gsl_cblas.

My assumption through all this, is that clapack offers more routines, and faster
implementations than those of the GSL.  The main advantage I hope for is that
the lapack implementation is seperate from the gsl, and can be chosen at
application link-time.

-- 
james bergstra
http://www-etud.iro.umontreal.ca/~bergstrj

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: gslclapack
  2006-03-26  2:41 gslclapack James Bergstra
@ 2006-03-28 11:48 ` Brian Gough
  2006-03-28 12:14   ` gslclapack C J Kenneth Tan -- OptimaNumerics
  2006-03-29  0:33   ` gslclapack Gerard Jungman
  0 siblings, 2 replies; 24+ messages in thread
From: Brian Gough @ 2006-03-28 11:48 UTC (permalink / raw)
  To: James Bergstra; +Cc: gsl-discuss

James Bergstra writes:
 > On one hand, I'm jealous that lapack has a fancy routine, and react "let's
 > implement that for gsl too" and on the other hand, I think "why is gsl
 > re-implementing (what I believe is) a very standard library?"

LAPACK is huge & tedious to install.  GSL is meant to be easy to use -
a dependency on lapack would be a pain.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: gslclapack
  2006-03-28 11:48 ` gslclapack Brian Gough
@ 2006-03-28 12:14   ` C J Kenneth Tan -- OptimaNumerics
  2006-03-28 13:43     ` gslclapack Robert G. Brown
  2006-03-29  0:33   ` gslclapack Gerard Jungman
  1 sibling, 1 reply; 24+ messages in thread
From: C J Kenneth Tan -- OptimaNumerics @ 2006-03-28 12:14 UTC (permalink / raw)
  To: Brian Gough; +Cc: James Bergstra, gsl-discuss

Brian,

On 2006-03-28 12:48 +0100 Brian Gough (bjg@network-theory.co.uk) wrote:

> Date: Tue, 28 Mar 2006 12:48:30 +0100
> From: Brian Gough <bjg@network-theory.co.uk>
> To: James Bergstra <james.bergstra@umontreal.ca>
> Cc: gsl-discuss <gsl-discuss@sources.redhat.com>
> Subject: Re: gslclapack
> 
> James Bergstra writes:
>  > On one hand, I'm jealous that lapack has a fancy routine, and react "let's
>  > implement that for gsl too" and on the other hand, I think "why is gsl
>  > re-implementing (what I believe is) a very standard library?"
> 
> LAPACK is huge & tedious to install.  GSL is meant to be easy to use -
> a dependency on lapack would be a pain.

How do you find LAPACK being a pain?  The public version of LAPACK has
been well tested and the design has been well thought out.  It has a
very comprehensive test suite also. 

Having LAPACK in the backend would also enable GSL users to take
advantage of high performance implementations of LAPACK.  We are able
to deliver substantially better performance LAPACK in OptimaNumerics
Libraries (even when compared to commercial products).  Some benchmark
results are included here:
http://www.OptimaNumerics.com/docs/hpc-asia05/hpc-asia05.pdf .

What's the price of performance?  This question can be phrased as
what's the price of electricity and what's the price of server room
space.


Kenneth Tan
-----------------------------------------------------------------------
C J Kenneth Tan, PhD
OptimaNumerics Ltd                    Telephone: +44 798 941 7838
E-mail: cjtan@OptimaNumerics.com      Telephone: +44 207 099 4428
Web: http://www.OptimaNumerics.com    Facsimile: +44 207 100 4572
-----------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: gslclapack
  2006-03-28 12:14   ` gslclapack C J Kenneth Tan -- OptimaNumerics
@ 2006-03-28 13:43     ` Robert G. Brown
  2006-03-28 23:38       ` gslclapack C J Kenneth Tan -- OptimaNumerics
  2006-03-29  1:57       ` gslclapack James Bergstra
  0 siblings, 2 replies; 24+ messages in thread
From: Robert G. Brown @ 2006-03-28 13:43 UTC (permalink / raw)
  To: C J Kenneth Tan -- OptimaNumerics
  Cc: Brian Gough, James Bergstra, gsl-discuss

On Tue, 28 Mar 2006, C J Kenneth Tan -- OptimaNumerics wrote:

> Brian,
>
> On 2006-03-28 12:48 +0100 Brian Gough (bjg@network-theory.co.uk) wrote:
>
>> Date: Tue, 28 Mar 2006 12:48:30 +0100
>> From: Brian Gough <bjg@network-theory.co.uk>
>> To: James Bergstra <james.bergstra@umontreal.ca>
>> Cc: gsl-discuss <gsl-discuss@sources.redhat.com>
>> Subject: Re: gslclapack
>>
>> James Bergstra writes:
>> > On one hand, I'm jealous that lapack has a fancy routine, and react "let's
>> > implement that for gsl too" and on the other hand, I think "why is gsl
>> > re-implementing (what I believe is) a very standard library?"
>>
>> LAPACK is huge & tedious to install.  GSL is meant to be easy to use -
>> a dependency on lapack would be a pain.
>
> How do you find LAPACK being a pain?  The public version of LAPACK has
> been well tested and the design has been well thought out.  It has a
> very comprehensive test suite also.

Is this an either/or proposition?  Having its own lapack implementation
just eliminates a dependency which I agree is desireable, and since the
system-provided LAPACK often sucks as far as efficiency goes it also
means that one CAN gradually start to tune up gsllapack. If the calls
are the same, though, presumably one could link into an alternative
library via suitable compile/link flags.  I thought this was the way
things were being done already, actually, so one could e.g. use ATLAS
BLAS:

  Linking with an alternative BLAS library

  The following command line shows how you would link the same application
  with an alternative CBLAS library called `libcblas',

  $ gcc example.o -lgsl -lcblas -lm

  For the best performance an optimized platform-specific CBLAS library
  should be used for -lcblas. The library must conform to the CBLAS
  standard. The ATLAS package provides a portable high-performance BLAS
  library with a CBLAS interface. It is free software and should be
  installed for any work requiring fast vector and matrix operations. The
  following command line will link with the ATLAS library and its CBLAS
  interface,

  $ gcc example.o -lgsl -lcblas -latlas -lm

So I don't see why there is an issue here.  LAPACK efficiency was, I
thought, "largely" inherited from BLAS although I'm sure there are
additional efficiencies one can realize if one works for them.  However
as long as the API is the same, one could presumably do -lgsllapack vs
-llapack as a compile/link time choice.  This is moderately more complex
at compile time, but it is easy to document and methodology that any
good programmer should be familiar with anyway.

> What's the price of performance?  This question can be phrased as
> what's the price of electricity and what's the price of server room
> space.

This is dead right, of course.  NOT having the ability to use optimized
libraries can cost you as much as a factor of 2-3 in the case of ATLAS
tuned BLAS, which for a linear algebra application can double the
productivity/dollar of hundreds of thousands of dollars of cluster
hardware (as Ken's sales talk so nicely shows:-).  One doesn't want
there to be a disincentive to the use of GSL in HPC so it can continue
to take over the world:-)

However I don't see this is an either/or proposition, and looking at:

> http://www.OptimaNumerics.com/docs/hpc-asia05/hpc-asia05.pdf .

... the important question isn't benchmarks against "competitor", it
is benchmarks against the specific case of GSL+ATLAS tuned BLAS and the
marginal advantages there, using the same compiler and support library
otherwise.  I'm more than curious to see what the marginal advantages
are of tuned LAPACK >>beyond<< those provided by using a good ATLAS
tuned BLAS and corresponding LAPACK.

Or if you like, a longer term question that is quite worth bringing up
is to what extent it is desireable to introduce architecture specific
tuning into actual GSL code, period.

On the one hand, it makes the code ugly and extremely difficult to
maintain and offers small marginal performance gains in most places BUT
linear algebra.  It tends to depend heavily on compiler as well -- if
you write code optimized for e.g.  presumed SSE1 or SSE2 instructions
your compiler has to support then, not just your CPU.  It makes the code
in a sense less portable.  It is "expensive" in human time and developer
time in the GSL is probably largely donated and a precious resource.
I'd rather give up 20-30% in "potential" performance and use the GSL for
free, personally, blessing the developers and their contribution to
world civilization as I do so. (Thanks, guys!:-)

On the other hand, for some applications tuning CAN be a big, big
performance booster -- 2-3x -- and hence very valuable.  I'm working
(depending on whether and how much support I get) on a portable/open way
of writing at least moderately autotuning HPC code.  Something that will
get "most" of the benefit of a full ATLAS-like autotuning build without
architecture-specific instrumentation.  Don't hold your breath, but in a
year or three maybe this problem can be at least attacked without
playing the library juggling game.

In the meantime, hey, the GSL is open source, full GPL, viral.  That
means that as long as the LAPACK API is preserved -- or a wrapper of the
standard lapack calls provided -- you can ALWAYS hack and link your own
LAPACK in.  The effort in any event is WAY less than the effort required
to actually build an architecture-dependent LAPACK in the first place,
and you can freely market a "tuned" version of the GSL as long as you
provide its sources, and it seems to me that in the case of lapack this
is as MOST developing a wrapper.  I've done similar things already for
adding e.g. rng's to the GSL -- it isn't difficult.  Or am I missing
something?

    rgb

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb@phy.duke.edu


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: gslclapack
  2006-03-28 13:43     ` gslclapack Robert G. Brown
@ 2006-03-28 23:38       ` C J Kenneth Tan -- OptimaNumerics
  2006-03-29  1:20         ` gslclapack Robert G. Brown
  2006-03-29  1:57       ` gslclapack James Bergstra
  1 sibling, 1 reply; 24+ messages in thread
From: C J Kenneth Tan -- OptimaNumerics @ 2006-03-28 23:38 UTC (permalink / raw)
  To: Robert G. Brown; +Cc: Brian Gough, James Bergstra, gsl-discuss

Robert,

On 2006-03-28 08:44 -0500 Robert G. Brown (rgb@phy.duke.edu) wrote:

> If the calls are the same, though, presumably one could link into an
> alternative library via suitable compile/link flags.

Yes, the API need to be kept the same.  

> So I don't see why there is an issue here.  LAPACK efficiency was, I
> thought, "largely" inherited from BLAS although I'm sure there are
> additional efficiencies one can realize if one works for them.  

Depends on the definition of "largely".  In our LAPACK benchmarks, we
use the same BLAS for the benchmarks, therefore showing the
differences in the LAPACK layer.  We found the differences could be
very high (as much as thousands of percents).  

> However I don't see this is an either/or proposition, and looking at:
> 
> > http://www.OptimaNumerics.com/docs/hpc-asia05/hpc-asia05.pdf .
> 
> ... the important question isn't benchmarks against "competitor", it
> is benchmarks against the specific case of GSL+ATLAS tuned BLAS and the
> marginal advantages there, using the same compiler and support library
> otherwise.  I'm more than curious to see what the marginal advantages
> are of tuned LAPACK >>beyond<< those provided by using a good ATLAS
> tuned BLAS and corresponding LAPACK.

As I have mentioned above, the BLAS layer is kept the same in each of the
benchmarks, therefore the benchmarks show the differences in LAPACK
performance. 

> Or if you like, a longer term question that is quite worth bringing up
> is to what extent it is desireable to introduce architecture specific
> tuning into actual GSL code, period.

Is this considered scalable from a development perspective?

> On the one hand, it makes the code ugly and extremely difficult to
> maintain and offers small marginal performance gains in most places BUT
> linear algebra.  It tends to depend heavily on compiler as well -- if
> you write code optimized for e.g.  presumed SSE1 or SSE2 instructions
> your compiler has to support then, not just your CPU.  It makes the code
> in a sense less portable.  It is "expensive" in human time and developer
> time in the GSL is probably largely donated and a precious resource.
> I'd rather give up 20-30% in "potential" performance and use the GSL for
> free, personally, blessing the developers and their contribution to
> world civilization as I do so. (Thanks, guys!:-)
> 
> On the other hand, for some applications tuning CAN be a big, big
> performance booster -- 2-3x -- and hence very valuable.  I'm working
> (depending on whether and how much support I get) on a portable/open way
> of writing at least moderately autotuning HPC code.  Something that will
> get "most" of the benefit of a full ATLAS-like autotuning build without
> architecture-specific instrumentation.  Don't hold your breath, but in a
> year or three maybe this problem can be at least attacked without
> playing the library juggling game.

Is this approach scalable when it comes to highly complex scientific
code, as opposed to the comparatively more straight forward BLAS level
code? 

> In the meantime, hey, the GSL is open source, full GPL, viral.  That
> means that as long as the LAPACK API is preserved -- or a wrapper of the
> standard lapack calls provided -- you can ALWAYS hack and link your own
> LAPACK in.  The effort in any event is WAY less than the effort required
> to actually build an architecture-dependent LAPACK in the first place,
> and you can freely market a "tuned" version of the GSL as long as you
> provide its sources, and it seems to me that in the case of lapack this
> is as MOST developing a wrapper.  I've done similar things already for
> adding e.g. rng's to the GSL -- it isn't difficult.  Or am I missing
> something?

Yes, we agree on this.  This is what I think is a good option.  It is
crucial that LAPACK API is preserved.


Kenneth Tan
-----------------------------------------------------------------------
C J Kenneth Tan, PhD
OptimaNumerics Ltd                    Telephone: +44 798 941 7838
E-mail: cjtan@OptimaNumerics.com      Telephone: +44 207 099 4428
Web: http://www.OptimaNumerics.com    Facsimile: +44 207 100 4572
-----------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: gslclapack
  2006-03-28 11:48 ` gslclapack Brian Gough
  2006-03-28 12:14   ` gslclapack C J Kenneth Tan -- OptimaNumerics
@ 2006-03-29  0:33   ` Gerard Jungman
  2006-03-29 17:45     ` gslclapack Linas Vepstas
  2006-03-31 23:19     ` gslclapack Brian Gough
  1 sibling, 2 replies; 24+ messages in thread
From: Gerard Jungman @ 2006-03-29  0:33 UTC (permalink / raw)
  To: gsl-discuss

On Tue, 2006-03-28 at 12:48 +0100, Brian Gough wrote:
> James Bergstra writes:
>  > On one hand, I'm jealous that lapack has a fancy routine, and react "let's
>  > implement that for gsl too" and on the other hand, I think "why is gsl
>  > re-implementing (what I believe is) a very standard library?"
> 
> LAPACK is huge & tedious to install.  GSL is meant to be easy to use -
> a dependency on lapack would be a pain.

This is too cheap an answer. It also smells too much like the
old "XYZ package is too large and complicated, we just need
to build our own" syndrome.

And there is absolutely no reason why "the right way"
should be in conflict with "easy to use". For example,
I think the way GSL does the cblas interface, with a
link-time resolution, is very easy to use, and it
seems like a pretty good approximation to the right way.
(Well, it was my idea, so of course I think so...).


There have been several comments on the original message, so I'll
try not to labor over the same points. Here is what I think is obvious:

  - gsl_linalg is basically a "mistake"; it satisfies the immediate
    need of a certain class of users, but in so doing it hardwires 
    an unattractive an inflexible scheme in place for everyone; even
    if you don't use it, you pay for it because it costs development
    time and takes up intellectual real estate. And I wrote a lot of
    it, so I think my opinion counts for something in this regard.
    It was my mistake.

  - GSL should provide an interface to lapack; it need not supersede
    the gsl_linalg implementation, but can coexist with it. It can
    be done in such a way that users can choose an appropriate
    implementation at link time, in the same way that the blas
    interface works. The only issue here is that we need to work
    out how to satisfy its blas dependencies. As long as it
    works in terms of cblas conformant calls, the dependencies
    will be resolved correctly at link time when the desired
    cblas implementation is picked up. If it works some other
    way, like deferring to an underlying fortran lapack   
    implementation which uses fortran-style name lookup
    for blas functions, then that would makes things
    less coherent and this would need to be fixed up
    in some way. I am hoping that the clapack standard
    says something about this situation; after all, it
    ought to be designed to work well with cblas.

  - Lapack may be difficult to build. I wouldn't know, since I
    never had to build it. I just install the RPM. As someone
    pointed out, the efficiency of lapack is essentially inherited
    from the underlying blas, so this out-of-the-can solution
    should be ok for nearly everyone. Note that I do not consider
    this to be an argument against having a GSL interface, since
    such an interface should take care of the GSL house-keeping
    chores, marshalling the arguments in the correct way for the
    more low-level interface, and dealing with error handling.
    It would be a thin interface, but still necessary.

  - It is _always_ better to reuse a general and mature solution
    like lapack than to create a half-baked substitute, no matter
    how good your intentions or how limited the apparent scope
    of the need. Even if there is no "efficiency" gained in using
    a real lapack implementation, many things are gained. These
    include complete coverage of methods, a large suite of tests,
    a robustness that only comes from the real thing, because of
    the large existing user base, and the elimination of maintenance
    for a large chunk of specialty code.


Some historical data:

  - We discussed creating a lapack interface, similar to the
    blas interface. The basic limitation was lack of manpower
    and some statements about the difficulty of dealing with
    workspace allocation/management, which I did not understood
    at the time, and still don't understand.

  - There was some concern about evolving standards. These
    decisions were made for GSL around 1998, when even cblas
    was just a draft concept. It was not clear what was in
    the future for lapack interfaces, so we decided to
    sidestep the problem.

  - There was some concern about licensing. This is mentioned
    in the TODO file, but I do not know what the question is.
    I doubt that this can be a serious problem.


If somebody dropped a working lapack interface on the table,
I would certainly want to put it in GSL. It's a clear win.

The only hard part is that we do need a reference implementation
of some sort to ship with GSL, like the gslcblas component which
ships, so that "casual" users can hit the ground running. Such
an implementation can coexist with gsl_linalg, and gsl_linalg
can be eventually deprecated.

This whole gsl_linalg thing is kind of embarrassing. People
expect to see a coherent picture for linear algebra, at the
very least a complete and functional lapack interface.
Instead, the GSL situation just smells a little fishy,
and there is a strong sociological argument for
cleaning up the mess, in order not to scare away
potential serious users.

-- 
Gerard Jungman <jungman@lanl.gov>
Los Alamos National Laboratory

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: gslclapack
  2006-03-28 23:38       ` gslclapack C J Kenneth Tan -- OptimaNumerics
@ 2006-03-29  1:20         ` Robert G. Brown
  0 siblings, 0 replies; 24+ messages in thread
From: Robert G. Brown @ 2006-03-29  1:20 UTC (permalink / raw)
  To: C J Kenneth Tan -- OptimaNumerics
  Cc: Brian Gough, James Bergstra, gsl-discuss

On Tue, 28 Mar 2006, C J Kenneth Tan -- OptimaNumerics wrote:

> As I have mentioned above, the BLAS layer is kept the same in each of the
> benchmarks, therefore the benchmarks show the differences in LAPACK
> performance.

Well, clearly that isn't NECESSARILY true.  Especially not if the BLAS
layer is doing 80% or 90% of the actual CPU-driven work for a given
call, as it is for many calls so that one can obtain the aforementioned
factor of 2-3 speedup that many people have reported for ATLAS-tuned
BLAS based LAPACK.  If BLAS variation alone can drop timings to 30% of
their worst case values, then reducing the non-BLAS code fraction timing
to ZERO would still leave you with only 30% speedup relative to a bad
BLAS, if you understand what I'm saying.

If you disagree with this, well it only makes the point more strongly --
it means that yes, LAPACK structure and BLAS structure ARE strongly
coupled so that testing LAPACK variations "holding BLAS constant" is
pointless, if you're moving the LAPACK in question into tune with the
BLAS in question and both have plenty of room for improvement.

Not that I'm disagreeing with the idea that different LAPACKs have
different performance or that yours is probably better -- only with the
idea that you can separate out LAPACK-specific performance issues from
the compiler, BLAS layer, linker and application/benchmark program by
"just" keeping some or all of them the same while varying the LAPACK
code.  Programming is way too nonlinear and coupled for that to be
valid.

I'd also be very suspicious of 1000% effects ANYWHERE in real-world
programming (and yes, very especially in LAPACK with a common BLAS
layer:-) -- that smacks of superlinear speedup or something that is true
but useless for 90% of all programmers -- or some truly terrible code,
ALGORITHMICALLY terrible and not just terrible from the point of view of
hand tuning some loops.

Not that I don't wish it were true and possible...I just never have
found it to be so except when fixing really boneheaded code... nothing
commercial grade, nothing competent programmer grade.  Although sure,
some LAPACKs may be very poor indeed here and effectively flog the cache
with their strides, just as non-ATLAS BLAS can easily do.

>> Or if you like, a longer term question that is quite worth bringing up
>> is to what extent it is desireable to introduce architecture specific
>> tuning into actual GSL code, period.
>
> Is this considered scalable from a development perspective?

I'd guess not, mostly.  But it's really up to the developers, right?

>> On the one hand, it makes the code ugly and extremely difficult to
>> maintain and offers small marginal performance gains in most places BUT
>> linear algebra.  It tends to depend heavily on compiler as well -- if
>> you write code optimized for e.g.  presumed SSE1 or SSE2 instructions
>> your compiler has to support then, not just your CPU.  It makes the code
>> in a sense less portable.  It is "expensive" in human time and developer
>> time in the GSL is probably largely donated and a precious resource.
>> I'd rather give up 20-30% in "potential" performance and use the GSL for
>> free, personally, blessing the developers and their contribution to
>> world civilization as I do so. (Thanks, guys!:-)
>>
>> On the other hand, for some applications tuning CAN be a big, big
>> performance booster -- 2-3x -- and hence very valuable.  I'm working
>> (depending on whether and how much support I get) on a portable/open way
>> of writing at least moderately autotuning HPC code.  Something that will
>> get "most" of the benefit of a full ATLAS-like autotuning build without
>> architecture-specific instrumentation.  Don't hold your breath, but in a
>> year or three maybe this problem can be at least attacked without
>> playing the library juggling game.
>
> Is this approach scalable when it comes to highly complex scientific
> code, as opposed to the comparatively more straight forward BLAS level
> code?

You're addressing research-level computer science questions, really.
From what I've seen, the GSL is already pretty decent in its algorithm
implementations as far as efficiency is concerned.  It isn't necessarily
"optimized", but it is good, competent code that won't run badly on
pretty much anything.  The ALGORITHMS are good, and are well
implemented.

To do better, an algorithm needs to be architecture aware to an extent
that few programs are, or really can be.  It needs portable "easy
access" to certain critical performance-determining metrics at runtime
initialization, so it can set e.g. stride sizes and perhaps change
algorithms depending on the measured details of the CPU/memory/IO bus
collective -- viewed as a set of sizes, latencies, bandwidths and so on.
ATLAS does what amounts to an empirical search of a multiparameter space
at build time, but in principle that space could be pre-searched --
mapped out in general terms -- and a smaller set of predictors
determined for setting parameters "close to" their optima.  You might
not get 100% of the possible speedup, but I'm guessing that you can get
80-90%, and from a conversation I had with Jack Dongarra it sounds like
this is what at least one researcher addressing the question is finding
empirically.

Note well that ATLAS/BLAS is close to the limit of algorithmical
complexity here -- most algorithms one might wish to tune are likely to
be simpler, many of them setting a single parameter for stride or the
blocking of a vector based on e.g. a measured or automatically
determined knowledge of the L1/L2 cache sizes and speeds relative to
main memory.  An example of one way GSL might be retuned for
optimization using information of this sort is in random number
generation.

Many RNGs use a very simple algorithm to actually generate each
successive number -- some additions, multiplications, moduli, done.
Many RNG users use LOTS AND LOTS of RNGs in a computation.  However, the
RNG interface in the GSL tends to generate them one at a time with a
subroutine/library call layer in between.  It might be possible, for
example, for the library to at least have a parameter that one could set
at initialization time that tells it to create a VECTOR BUFFER of rands
of some suitable length -- say a page, or several pages -- and to go
ahead and fill it on the initial call using SSE-optimized vector code.
Then simply return each successive number on demand and increment a
static pointer in the vector until it runs out and then do it again.

This would likely never run more SLOWLY than single-number-per-call
code, per RNG returned, although it makes the timing of those calls
"odd".  For a MC person like myself, I could care less as long as that
average time per call goes down -- I use numbers like 10^18 of them in a
long running computation and it's my life that's awasting in the
meantime.

What's needed to support this sort of thing?  First, some evidence that
there is marginal benefit worth the effort in coding.  Second, you
probably wouldn't do it anywhere but the "bread and butter" RNGs -- the
ones that are the best, fastest, pass diehard, are the default or
equally good alternatives.  Why speed up any of the RNGs that are
obviously flawed and there only so people can compute and compare the
flaws?  Third some idea that the optimization you can do is PORTABLE and
will actually improve performance on e.g. SPARC, i386, X86_64, etc with
at most small parametric tunings that can be done at runtime based on
the aforementioned metrics.

That might not be so bad or impossible to maintain, and if there were
good software support for obtaining the right metrics, they might prove
useful to ordinary numerical programmers in normal code instances --
LOTS of people do loops filled with arithmetic, few people have the
tools or energy or knowledge to be able to determine how to block out
those loops to run fast.

>> In the meantime, hey, the GSL is open source, full GPL, viral.  That
>> means that as long as the LAPACK API is preserved -- or a wrapper of the
>> standard lapack calls provided -- you can ALWAYS hack and link your own
>> LAPACK in.  The effort in any event is WAY less than the effort required
>> to actually build an architecture-dependent LAPACK in the first place,
>> and you can freely market a "tuned" version of the GSL as long as you
>> provide its sources, and it seems to me that in the case of lapack this
>> is as MOST developing a wrapper.  I've done similar things already for
>> adding e.g. rng's to the GSL -- it isn't difficult.  Or am I missing
>> something?
>
> Yes, we agree on this.  This is what I think is a good option.  It is
> crucial that LAPACK API is preserved.

Interesting discussion:-)

A lot of it is similar to parallel performance discussions on the
beowulf list.  There are easy ways to parallelize some things, but if
you really want to get serious you may need to e.g. get Ian Foster's
book (or one of the other good parallel programming books, e.g. Amalfi)
and learn about parallel algorithms.  Then you find that yeah, your
"obvious" algorithm gave you terrible speedup for reasons that in
retrospect are equally obvious.

Performance tuning in general is SO nonlinear and SO tightly coupled to
architecture, compiler choice, library set.  Then there is algorithmic
robustness to consider (don't want to get fast, incorrect answers as
legend has it one can easily do with Cell processors if you aren't
careful).  Most real numerical code I personally have written has been a
compromise, with efficiency subjugated to portability, long term
maintainability, and not getting wrong answers to the extent I can
control it with numerical error to contend with.  I think that these are
sensible design criteria for the GSL as well, where of course one would
hope to use efficient algorithms whereever possible within these
constrants.

    rgb

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb@phy.duke.edu


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: gslclapack
  2006-03-28 13:43     ` gslclapack Robert G. Brown
  2006-03-28 23:38       ` gslclapack C J Kenneth Tan -- OptimaNumerics
@ 2006-03-29  1:57       ` James Bergstra
  1 sibling, 0 replies; 24+ messages in thread
From: James Bergstra @ 2006-03-29  1:57 UTC (permalink / raw)
  To: gsl-discuss

> >>> On one hand, I'm jealous that lapack has a fancy routine, and react 
> >>"let's
> >>> implement that for gsl too" and on the other hand, I think "why is gsl
> >>> re-implementing (what I believe is) a very standard library?"
> >>
> >>LAPACK is huge & tedious to install.  GSL is meant to be easy to use -
> >>a dependency on lapack would be a pain.
> >
> >How do you find LAPACK being a pain?  The public version of LAPACK has
> >been well tested and the design has been well thought out.  It has a
> >very comprehensive test suite also.
> 
> Is this an either/or proposition?  Having its own lapack implementation
> just eliminates a dependency which I agree is desireable, and since the
> system-provided LAPACK often sucks as far as efficiency goes it also
> means that one CAN gradually start to tune up gsllapack. If the calls
> are the same, though, presumably one could link into an alternative
> library via suitable compile/link flags.  I thought this was the way
> things were being done already, actually, so one could e.g. use ATLAS
> BLAS:

My hope was to encourage interchangeable implementations.

But it seems lapack routines cannot be used as drop-in replacements to implement
linalg_xxxx ones on account of the row-major, col-major business.

Is this true?

-- 
james bergstra
http://www-etud.iro.umontreal.ca/~bergstrj

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: gslclapack
  2006-03-29  0:33   ` gslclapack Gerard Jungman
@ 2006-03-29 17:45     ` Linas Vepstas
  2006-03-29 18:04       ` gslclapack J.J.Green
  2006-03-29 23:04       ` gslclapack Gerard Jungman
  2006-03-31 23:19     ` gslclapack Brian Gough
  1 sibling, 2 replies; 24+ messages in thread
From: Linas Vepstas @ 2006-03-29 17:45 UTC (permalink / raw)
  To: Gerard Jungman; +Cc: gsl-discuss


I found this conversation a tad too abstract.

On Tue, Mar 28, 2006 at 05:30:07PM -0700, Gerard Jungman wrote:
>     I am hoping that the clapack standard
>     says something about this situation; after all, it
>     ought to be designed to work well with cblas.

I don't udnerstand. A google of clapack reveals things like 
http://www.netlib.org/clapack/

which seems no better than calling the fortran libs directly 
from c code. And the problem with this is, of course, that 

-- the fortran subroutine names are a bit too short and cryptic
-- the fortran calling convention is rather bizarre for C
   (see e.g. the ick at http://www.netlib.org/clapack/clapack.h)
-- the only decent documentation for the actual subroutine 
   arguments seems to exist only in the fortran code, making 
   it hard to read. :-( 

These seem to be the major probblems, right?

>   - We discussed creating a lapack interface, similar to the
>     blas interface. The basic limitation was lack of manpower
>     and some statements about the difficulty of dealing with
>     workspace allocation/management, which I did not understood
>     at the time, and still don't understand.

Right. This is part of what makes the existing fortran interfaces
nasty to use.

> If somebody dropped a working lapack interface on the table,
> I would certainly want to put it in GSL. It's a clear win.

I presume you mean "a C wrapper that solves the strange 
calling convention, documentiation and workspace problems". ?

--linas

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: gslclapack
  2006-03-29 17:45     ` gslclapack Linas Vepstas
@ 2006-03-29 18:04       ` J.J.Green
  2006-03-29 18:51         ` gslclapack James Bergstra
  2006-03-29 23:04       ` gslclapack Gerard Jungman
  1 sibling, 1 reply; 24+ messages in thread
From: J.J.Green @ 2006-03-29 18:04 UTC (permalink / raw)
  To: Linas Vepstas; +Cc: Gerard Jungman, gsl-discuss

Hi

> which seems no better than calling the fortran libs directly
> from c code. And the problem with this is, of course, that
   [snip]
> These seem to be the major probblems, right?

I think the problem with calling fortran from C is that
it is compiler, OS and (I think) architecture dependant.
CLAPACK solves the portability problem, but as you have
noticed, not the arcane interface which puts most poeple
off using it.

Cheers

Jim
-- 
J.J. Green, Dept. Applied Mathematics, Hicks Bld.,
University of Sheffield, UK.   +44 (0114) 222 3742
http://pdfb.wiredworkplace.net/pub/jjg


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: gslclapack
  2006-03-29 18:04       ` gslclapack J.J.Green
@ 2006-03-29 18:51         ` James Bergstra
  2006-03-30 16:44           ` [row/column majority] gslclapack Z F
  0 siblings, 1 reply; 24+ messages in thread
From: James Bergstra @ 2006-03-29 18:51 UTC (permalink / raw)
  To: gsl-discuss

On Wed, Mar 29, 2006 at 07:04:39PM +0100, J.J.Green wrote:
> I think the problem with calling fortran from C is that
> it is compiler, OS and (I think) architecture dependant.
> CLAPACK solves the portability problem, but as you have
> noticed, not the arcane interface which puts most poeple
> off using it.

Exactly the motivation for a gsl wrapper (or extension)!  A
gsl module would serve both as a gentle introduction to
using lapack, as well as a hook to place normal
documentation.

It seems to me that the largest barrier to a lapack wrapper
is that gsl_matrix is an implicitly row-major structure, so
that in general, arguments must be out-of-place transposed
before a lapack call.  Although this is not significant in
terms of either O(n) time or space,  the overhead is more
noticeable in the case of large numbers of calls with small
arguments.

Anyway, it seems like the concensus is that lapack is
annoying to use but good in theory, once it's working.  If
someone (eg. me) writes an extension that makes (at least
part of lapack) less annoying to use in practice, share with
the list?


James

PS.  My extension would probably introduce a column-major
version of gsl_matrix.

-- 
james bergstra
http://www-etud.iro.umontreal.ca/~bergstrj

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: gslclapack
  2006-03-29 17:45     ` gslclapack Linas Vepstas
  2006-03-29 18:04       ` gslclapack J.J.Green
@ 2006-03-29 23:04       ` Gerard Jungman
  1 sibling, 0 replies; 24+ messages in thread
From: Gerard Jungman @ 2006-03-29 23:04 UTC (permalink / raw)
  To: Linas Vepstas; +Cc: gsl-discuss

On Wed, 2006-03-29 at 11:44 -0600, Linas Vepstas wrote:
> I found this conversation a tad too abstract.
> 
> On Tue, Mar 28, 2006 at 05:30:07PM -0700, Gerard Jungman wrote:
> >     I am hoping that the clapack standard
> >     says something about this situation; after all, it
> >     ought to be designed to work well with cblas.
> 
> I don't udnerstand. A google of clapack reveals things like 
> http://www.netlib.org/clapack/
> 
> which seems no better than calling the fortran libs directly 
> from c code. And the problem with this is, of course, that 

Sorry. I was not precise enough. There is a thing called "clapack",
which is disgusting (I think it is basically f2c run on the lapack
source, followed by twiddling). What I mean by a "C lapack" is
whatever the new development lapack cycle is going to produce.
Currently, Dongarra et al are supposed to be working on improved
versions of lapack and sca-lapack. This new work includes
standardized wrappers for a set of languages, like C, python, f95, etc.
They propose to implement in f77, as before, which is understandable.

The (funded) proposal is available at
http://www.cs.berkeley.edu/~demmel/Sca-LAPACK-Proposal.pdf

The announcement related to this can be found by 
searching for "New Release of LAPACK and ScaLAPACK planned"
on google-groups.

They state explicitly that they want to use "better interfaces
using features of up-to-date programming langauges" to "hide
some of the complex distributed data structures and other features".
These are exactly the kind of interface improvements we want,
so it would be great if they could sort it all out and
produce something attractive.


> > If somebody dropped a working lapack interface on the table,
> > I would certainly want to put it in GSL. It's a clear win.
> 
> I presume you mean "a C wrapper that solves the strange 
> calling convention, documentiation and workspace problems". ?

Exactly, as per the above.

Linas, thanks for the chance to clarify what I meant!

-- 
Gerard Jungman <jungman@lanl.gov>
Los Alamos National Laboratory

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [row/column majority] gslclapack
  2006-03-29 18:51         ` gslclapack James Bergstra
@ 2006-03-30 16:44           ` Z F
  2006-03-30 19:59             ` Robert G. Brown
  0 siblings, 1 reply; 24+ messages in thread
From: Z F @ 2006-03-30 16:44 UTC (permalink / raw)
  To: James Bergstra, gsl-discuss


Row/column majority question comes from the times when
God was REAL and Infinity was INTEGER (FORTRAN). But even
at that time, some people, I hope, knew that the question of 
row/column majority is irrelevant. Think about it!

So think.... Sorry, but I will use capital letters...

COLUMN-MAJOR or ROW-MAJOR ??? IT IS ALL IN YOUR HEAD!!!!
IT IS NOT REAL AND DOES NOT EXIST!!! THE MATRIX IS STORED
AS A CONTINUOUS OBJECT AND ADDRESSING IS DONE VIA a FORUMLA
LIKE THIS

index = j*Ni+i!!!

SO GIVE THE MAJORITY A REST!!!

Once again, sorry if I offended anyone, but someone had to point it
out.


Lazar

--- James Bergstra <james.bergstra@umontreal.ca> wrote:

> On Wed, Mar 29, 2006 at 07:04:39PM +0100, J.J.Green wrote:
> > I think the problem with calling fortran from C is that
> > it is compiler, OS and (I think) architecture dependant.
> > CLAPACK solves the portability problem, but as you have
> > noticed, not the arcane interface which puts most poeple
> > off using it.
> 
> Exactly the motivation for a gsl wrapper (or extension)!  A
> gsl module would serve both as a gentle introduction to
> using lapack, as well as a hook to place normal
> documentation.
> 
> It seems to me that the largest barrier to a lapack wrapper
> is that gsl_matrix is an implicitly row-major structure, so
> that in general, arguments must be out-of-place transposed
> before a lapack call.  Although this is not significant in
> terms of either O(n) time or space,  the overhead is more
> noticeable in the case of large numbers of calls with small
> arguments.
> 
> Anyway, it seems like the concensus is that lapack is
> annoying to use but good in theory, once it's working.  If
> someone (eg. me) writes an extension that makes (at least
> part of lapack) less annoying to use in practice, share with
> the list?
> 
> 
> James
> 
> PS.  My extension would probably introduce a column-major
> version of gsl_matrix.
> 
> -- 
> james bergstra
> http://www-etud.iro.umontreal.ca/~bergstrj
> 
> 


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [row/column majority] gslclapack
  2006-03-30 16:44           ` [row/column majority] gslclapack Z F
@ 2006-03-30 19:59             ` Robert G. Brown
  2006-03-30 21:04               ` Z F
  2006-03-30 22:24               ` James Bergstra
  0 siblings, 2 replies; 24+ messages in thread
From: Robert G. Brown @ 2006-03-30 19:59 UTC (permalink / raw)
  To: Z F; +Cc: James Bergstra, gsl-discuss

On Thu, 30 Mar 2006, Z F wrote:

>
> Row/column majority question comes from the times when
> God was REAL and Infinity was INTEGER (FORTRAN). But even
> at that time, some people, I hope, knew that the question of
> row/column majority is irrelevant. Think about it!
>
> So think.... Sorry, but I will use capital letters...
>
> COLUMN-MAJOR or ROW-MAJOR ??? IT IS ALL IN YOUR HEAD!!!!
> IT IS NOT REAL AND DOES NOT EXIST!!! THE MATRIX IS STORED
> AS A CONTINUOUS OBJECT AND ADDRESSING IS DONE VIA a FORUMLA
> LIKE THIS
>
> index = j*Ni+i!!!
>
> SO GIVE THE MAJORITY A REST!!!
>
> Once again, sorry if I offended anyone, but someone had to point it
> out.

Yes, but CPUs really HATE doing arithmetic with a string of objects
separated by a large offset.  They much prefer to do arithmetic with a
string of objects with unit offset.  That way caches, prefetch and so on
work.  They indicate their disapproval by running really really slowly.

In other words, it is good if "j" isn't the rapidly varying index in
your example of offset arithmetic.

    rgb

>
>
> Lazar
>
> --- James Bergstra <james.bergstra@umontreal.ca> wrote:
>
>> On Wed, Mar 29, 2006 at 07:04:39PM +0100, J.J.Green wrote:
>>> I think the problem with calling fortran from C is that
>>> it is compiler, OS and (I think) architecture dependant.
>>> CLAPACK solves the portability problem, but as you have
>>> noticed, not the arcane interface which puts most poeple
>>> off using it.
>>
>> Exactly the motivation for a gsl wrapper (or extension)!  A
>> gsl module would serve both as a gentle introduction to
>> using lapack, as well as a hook to place normal
>> documentation.
>>
>> It seems to me that the largest barrier to a lapack wrapper
>> is that gsl_matrix is an implicitly row-major structure, so
>> that in general, arguments must be out-of-place transposed
>> before a lapack call.  Although this is not significant in
>> terms of either O(n) time or space,  the overhead is more
>> noticeable in the case of large numbers of calls with small
>> arguments.
>>
>> Anyway, it seems like the concensus is that lapack is
>> annoying to use but good in theory, once it's working.  If
>> someone (eg. me) writes an extension that makes (at least
>> part of lapack) less annoying to use in practice, share with
>> the list?
>>
>>
>> James
>>
>> PS.  My extension would probably introduce a column-major
>> version of gsl_matrix.
>>
>> --
>> james bergstra
>> http://www-etud.iro.umontreal.ca/~bergstrj
>>
>>
>
>
> __________________________________________________
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam protection around
> http://mail.yahoo.com
>

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb@phy.duke.edu


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [row/column majority] gslclapack
  2006-03-30 19:59             ` Robert G. Brown
@ 2006-03-30 21:04               ` Z F
  2006-04-01 21:48                 ` Robert G. Brown
  2006-03-30 22:24               ` James Bergstra
  1 sibling, 1 reply; 24+ messages in thread
From: Z F @ 2006-03-30 21:04 UTC (permalink / raw)
  To: Robert G. Brown; +Cc: James Bergstra, gsl-discuss



--- "Robert G. Brown" <rgb@phy.duke.edu> wrote:

> On Thu, 30 Mar 2006, Z F wrote:
> 
> >
> > Row/column majority question comes from the times when
> > God was REAL and Infinity was INTEGER (FORTRAN). But even
> > at that time, some people, I hope, knew that the question of
> > row/column majority is irrelevant. Think about it!
> >
> > So think.... Sorry, but I will use capital letters...
> >
> > COLUMN-MAJOR or ROW-MAJOR ??? IT IS ALL IN YOUR HEAD!!!!
> > IT IS NOT REAL AND DOES NOT EXIST!!! THE MATRIX IS STORED
> > AS A CONTINUOUS OBJECT AND ADDRESSING IS DONE VIA a FORUMLA
> > LIKE THIS
> >
> > index = j*Ni+i!!!
> >
> > SO GIVE THE MAJORITY A REST!!!
> >
> > Once again, sorry if I offended anyone, but someone had to point it
> > out.
> 
> Yes, but CPUs really HATE doing arithmetic with a string of objects
> separated by a large offset.  They much prefer to do arithmetic with
> a
> string of objects with unit offset.  That way caches, prefetch and so
> on
> work.  They indicate their disapproval by running really really
> slowly.

I agree 100% with you here. This is where hardware has to be
taken into account when optimizations are done.


> In other words, it is good if "j" isn't the rapidly varying index in
> your example of offset arithmetic.

Sorry, I fail to see how this is relatated to the problem of
the majority. In my example 'i' was the rapidly changing index by
definition. If you want to see 'j' as the rapidly changing one, I 
would write:

index = i*Nj+j

but what does it change?

The difference is how a[i,j] are interpreted 

a[i,j] => a[i*Nj+j]
a[i,j] => a[j*Ni+i]

one is called the row major, the other is the column major (sorry do
not know which is which), but
nothing changed. As long as the access is done via this indexing.

Can someone tell me why one is faster than the other? Or more
convenient?

Thanks

Lazar

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [row/column majority] gslclapack
  2006-03-30 19:59             ` Robert G. Brown
  2006-03-30 21:04               ` Z F
@ 2006-03-30 22:24               ` James Bergstra
  2006-03-30 23:01                 ` Z F
  2006-03-31  0:09                 ` Linas Vepstas
  1 sibling, 2 replies; 24+ messages in thread
From: James Bergstra @ 2006-03-30 22:24 UTC (permalink / raw)
  To: gsl-discuss

> On Thu, 30 Mar 2006, Z F wrote:
> >COLUMN-MAJOR or ROW-MAJOR ??? IT IS ALL IN YOUR HEAD!!!!
> >IT IS NOT REAL AND DOES NOT EXIST!!! THE MATRIX IS STORED
> >AS A CONTINUOUS OBJECT AND ADDRESSING IS DONE VIA a FORUMLA
> >LIKE THIS

Can someone explain why we have CblasRowMajor and CblasColMajor?

I just did a little test to convince myself, and indeed (with
-lgslcblas) the two following calls seem to produce the same answers
in the variable c.

    cblas_dgemm(CblasRowMajor, T2, T1, ...  a, b, c)
    cblas_dgemm(CblasColMajor, T2, T1, ...  b, a, c)

T1 and T2 are each one of CblasNoTrans and CblasTrans.

Did I make a mistake?  Are there other BLAS functions for which you
need to specify the data format for things to work?

-- 
james bergstra
http://www-etud.iro.umontreal.ca/~bergstrj

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [row/column majority] gslclapack
  2006-03-30 22:24               ` James Bergstra
@ 2006-03-30 23:01                 ` Z F
  2006-03-31  0:09                 ` Linas Vepstas
  1 sibling, 0 replies; 24+ messages in thread
From: Z F @ 2006-03-30 23:01 UTC (permalink / raw)
  To: James Bergstra, gsl-discuss

Using RowMajor instead of ColMajor is like matrix transpose operation.

so A*B = (B^T * A^T)^T.
That is why the answers are the same. 
Why the switch is needed? I do not know, but can speculate...
I suspect it is because fortran 77 can not allocate memory,
and to facilitate memory management, Col/Row Major and transpose might
be different operations if the size of matrix is smaller than the
allocated memory.

Lazar

> Can someone explain why we have CblasRowMajor and CblasColMajor?
> 
> I just did a little test to convince myself, and indeed (with
> -lgslcblas) the two following calls seem to produce the same answers
> in the variable c.
> 
>     cblas_dgemm(CblasRowMajor, T2, T1, ...  a, b, c)
>     cblas_dgemm(CblasColMajor, T2, T1, ...  b, a, c)
> 
> T1 and T2 are each one of CblasNoTrans and CblasTrans.
> 
> Did I make a mistake?  Are there other BLAS functions for which you
> need to specify the data format for things to work?
> 


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [row/column majority] gslclapack
  2006-03-30 22:24               ` James Bergstra
  2006-03-30 23:01                 ` Z F
@ 2006-03-31  0:09                 ` Linas Vepstas
  2006-03-31 17:45                   ` James Bergstra
  1 sibling, 1 reply; 24+ messages in thread
From: Linas Vepstas @ 2006-03-31  0:09 UTC (permalink / raw)
  To: James Bergstra; +Cc: gsl-discuss

On Thu, Mar 30, 2006 at 05:24:25PM -0500, James Bergstra wrote:
> 
> Can someone explain why we have CblasRowMajor and CblasColMajor?
> 
> I just did a little test to convince myself, and indeed (with
> -lgslcblas) the two following calls seem to produce the same answers
> in the variable c.
> 
>     cblas_dgemm(CblasRowMajor, T2, T1, ...  a, b, c)
>     cblas_dgemm(CblasColMajor, T2, T1, ...  b, a, c)
> 
> T1 and T2 are each one of CblasNoTrans and CblasTrans.
> 
> Did I make a mistake?  Are there other BLAS functions for which you
> need to specify the data format for things to work?

For matrices that are larger than the native L1, L2, and L3 caches 
on your machine (typically some dozens of KB, 1 MB and 4-16MB
respectively), you should see (possibly very large) performance
differences. Performance depends very strongly on memory access
patterns. 

--linas

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [row/column majority] gslclapack
  2006-03-31  0:09                 ` Linas Vepstas
@ 2006-03-31 17:45                   ` James Bergstra
  2006-03-31 18:10                     ` Linas Vepstas
  2006-03-31 23:30                     ` Brian Gough
  0 siblings, 2 replies; 24+ messages in thread
From: James Bergstra @ 2006-03-31 17:45 UTC (permalink / raw)
  To: Linas Vepstas; +Cc: James Bergstra, gsl-discuss

> On Thu, Mar 30, 2006 at 05:24:25PM -0500, James Bergstra wrote:
> > 
> > Can someone explain why we have CblasRowMajor and CblasColMajor?
> > 
> > I just did a little test to convince myself, and indeed (with
> > -lgslcblas) the two following calls seem to produce the same answers
> > in the variable c.
> > 
> >     cblas_dgemm(CblasRowMajor, T2, T1, ...  a, b, c)
> >     cblas_dgemm(CblasColMajor, T2, T1, ...  b, a, c)
> > 
> > T1 and T2 are each one of CblasNoTrans and CblasTrans.
> > 
> > Did I make a mistake?  Are there other BLAS functions for which you
> > need to specify the data format for things to work?
> 
On Thu, Mar 30, 2006 at 06:09:09PM -0600, Linas Vepstas wrote:
> For matrices that are larger than the native L1, L2, and L3 caches 
> on your machine (typically some dozens of KB, 1 MB and 4-16MB
> respectively), you should see (possibly very large) performance
> differences. Performance depends very strongly on memory access
> patterns. 

This answer doesn't make sense to me, because in my example the memory access
patterns are identical.  I'm multiplying the same physical matrices together,
and getting the same physical result.  If one call is slower than the other then
I would think it's a bad implementation.

-- 
james bergstra
http://www-etud.iro.umontreal.ca/~bergstrj

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [row/column majority] gslclapack
  2006-03-31 17:45                   ` James Bergstra
@ 2006-03-31 18:10                     ` Linas Vepstas
  2006-03-31 23:30                     ` Brian Gough
  1 sibling, 0 replies; 24+ messages in thread
From: Linas Vepstas @ 2006-03-31 18:10 UTC (permalink / raw)
  To: James Bergstra; +Cc: gsl-discuss

On Fri, Mar 31, 2006 at 12:45:14PM -0500, James Bergstra wrote:
> 
> This answer doesn't make sense to me, because in my example the memory access
> patterns are identical.  


If the memory access patterns are identical, then I guess I don't get
the point?  

--linas

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: gslclapack
  2006-03-29  0:33   ` gslclapack Gerard Jungman
  2006-03-29 17:45     ` gslclapack Linas Vepstas
@ 2006-03-31 23:19     ` Brian Gough
  2006-04-03 19:47       ` gslclapack Gerard Jungman
  1 sibling, 1 reply; 24+ messages in thread
From: Brian Gough @ 2006-03-31 23:19 UTC (permalink / raw)
  To: gsl-discuss

Gerard Jungman writes:
 >   - We discussed creating a lapack interface, similar to the
 >     blas interface. The basic limitation was lack of manpower
 >     and some statements about the difficulty of dealing with
 >     workspace allocation/management, which I did not understood
 >     at the time, and still don't understand.

The basic problem with a LAPACK interface is there is no clean way to
do it.  I've tried this several times.  As we remember, with BLAS you
have the 'CBLAS trick' to get the row-major versions of Fortran BLAS
by using the BLAS transpose argument, which is conveniently provided
in all necessary functions.

The LAPACK functions typically don't have a transpose argument and are
hard-coded for column-major.  This means either you have to use
column-major in C (which is inconvenient if you have other C-style
matrices around) or you need to provide transposed matrices, either by
logically solving the transposed system or physically transposing them
before each call and then transposing them back (which is what Numpy
does).

If the matrix has M=N or N=TDA then you can do the transpose in-place,
so you don't need extra memory.  But in the general case M!=N, N!=TDA
you have to allocate a copy of the matrix.  Obviously for large
matrices this is wasteful of memory, but these are the sorts of
systems you would want to solve.  

Already you are looking at 3 possible types of wrapper function for
each LAPACK function to handle the different cases.  But it's more
complicated than that because for functions with multiple arguments
you might have different cases for each argument.  

So either there has to be some radical simplifying assumption like
"always allocate memory and transpose" or "only handle matrices with
N=TDA" or "make the user pass everything as column-major". But people
will surely complain whichever is chosen.  

Until LAPACK has transpose arguments to make a 'LAPACK trick' possible
it's a lost cause and is probably easier for people call LAPACK
directly on a case-by-case basis.

-- 
Brian Gough

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [row/column majority] gslclapack
  2006-03-31 17:45                   ` James Bergstra
  2006-03-31 18:10                     ` Linas Vepstas
@ 2006-03-31 23:30                     ` Brian Gough
  1 sibling, 0 replies; 24+ messages in thread
From: Brian Gough @ 2006-03-31 23:30 UTC (permalink / raw)
  To: gsl-discuss

James Bergstra writes:
 > > On Thu, Mar 30, 2006 at 05:24:25PM -0500, James Bergstra wrote:
 > > > 
 > > > Can someone explain why we have CblasRowMajor and CblasColMajor?

The cblas routines have internal branches which optimise for locality
in the rows or columns as appropriate.  These are equivalent to calls
to the Fortran blas using transposes.  This is the 'cblas trick' which
allows cblas to be written in terms of an underlying fortran
blas. There is a paper about it, I don't have the reference to hand.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [row/column majority] gslclapack
  2006-03-30 21:04               ` Z F
@ 2006-04-01 21:48                 ` Robert G. Brown
  0 siblings, 0 replies; 24+ messages in thread
From: Robert G. Brown @ 2006-04-01 21:48 UTC (permalink / raw)
  To: Z F; +Cc: James Bergstra, gsl-discuss

On Thu, 30 Mar 2006, Z F wrote:

> The difference is how a[i,j] are interpreted
>
> a[i,j] => a[i*Nj+j]
> a[i,j] => a[j*Ni+i]
>
> one is called the row major, the other is the column major (sorry do
> not know which is which), but
> nothing changed. As long as the access is done via this indexing.
>
> Can someone tell me why one is faster than the other? Or more
> convenient?

Sorry, I was out of town for a few days.  In case nobody else answered,
"to make sure that they match in your algorithms".

In C they are referenced a[i][j]...[m][n], usually after being defined
and allocated via something like (e.g., and I'm not checking my code
carefully here so I may have an error or two):

   int i,j;
   double **a;
   a = (double **) malloc (imax*sizeof(*double));
   for(i=0;i<imax;i++)
     a[i] = (double *) malloc (jmax*sizeof(double));
     for(j=0;j<jmax;j++) a[i][j] = i*imax + j;
   }

To free a it is NOT ENOUGH to just free a -- one has to go through and
free all the blocks whose address is stored in a[i], THEN free a.
Otherwise you leak memory.  You iterate the process to do ***a or ****a
(three or four dimensional double precision tensors).  You could also
define or use a complex or quaternionic or whatever struct and malloc
the highest index time sizeof the struct -- the code is quite flexible
wrt data typing and objects.

This is common enough, but not necessarily very good code.  Two things
that one can do to make it "better" code are a) add or subtract an
offset so that the indices don't have to start at 0 and run to imax-1.
Fortran indices by default run from 1 to imax.  b) Allocate the entire
vector of memory as a block.  In the code above, if it STAYS on the CPU
you are LIKELY enough to get a contiguous block of memory from the
sequential mallocs, but you are not guaranteed to -- it is up to the
kernel.  If the system is multitasking and it happens to give another
job a timeslice in the middle of the loop of mallocs, it is somewhat
LIKELY to introduce an arbitrarily large offset in the very middle of
what you "think" is a single contiguous block of memory.  Needless to
say, this will make all sorts of matrix code unhappy every time your
loop addressing invisibly crosses the boundary.  Some mallocs/compilers
MIGHT recognize this at run time and work with the kernel to free up a
contiguous block, but I wouldn't count on it.

The offset part is trivial.  The contiguous block part is managed by
introducing one more pointer:

   int i,j;
   double **a;
   double *adata;
   adata = (double *) malloc(imax*jmax*sizeof(double));
   /*
    * adata is now a CONTIGUOUS block of memory long enough to hold the
    * entire matrix.  We now have to pack the ADDRESSES of the appropriate
    * offsets into **a.
    */
   a = (double **) malloc (imax*sizeof(*double));
   for(i=0;i<imax;i++)
     a[i] = &adata[jmax*i];
     for(j=0;j<jmax;j++) a[i][j] = i*imax + j;
   }

Where I'm being sloppy about casts (arg of malloc should be size_t, for
example IIRC).  For a 2x2 matrix this should result in:
   (simulated memory layout as address:(type) or address:contents)
adata = [0000:(double)][0008:(double)][0010:(double)][0018:(double)]
a = [0020:(double *)][0024:(double *]

Then

a = [0020:0000][0024:0010]

(a[i] now contains the addresses 0000 and 0010 in adata)

and finally

adata = [0000:(double)0][0008:(double)1][0010:(double)2][0018:(double)3]

where (note well) we ADDRESSED the vector via a[i][j]!  This saves one
time messing with offset arithmetic.  Note again that to free a without
leaks you have to both free a and free adata, although you can do lots
of other neato stuff with this idea -- maintain different "windows" into
adata, free a WITHOUT freeing adata.  Some of which is of course very
dangerous coding practice:-)

As to why one has to be careful -- in your actual code you want to be
sure to try to do operations like:

  for(i = 0;i<imax;i++){
    b[i] = 0.0;
    for(j = 0;j<jmax;j++){
      b[i] += a[i][j]*c[j];
    }
  }

and try NOT to do:

  for(i = 0;i<imax;i++){
    b[i] = 0.0;
    for(j = 0;j<jmax;j++){
      b[i] += a[j][i]*c[j];
    }
  }

It may seem silly, but you want this to match up between your C
assumptions in allocating and laying out **a and any fortran-derived
routines for doing linear algebra.  Consequently you want to make sure
that fortran lays out a(i,j) the same way that you've laid out a[i][j].
Obviously that means "at least" unit offsets and strictly rectangular
matrices, but fortran doesn't (IIRC) actually SPECIFY which one of these
two indices is the local/contiguous one.  However, I could be wrong as I
haven't used fortran -- by choice -- for close to twenty years...;-)
Also I think that fortran has even more restrictions or constraints when
trying to use fortran routines in C code -- most of the times I've tried
this I've given up before figuring it out for sure because it is a PITA.

IMO, the ability to do malloc/pointer magic like this is both one of C's
real strengths and one of its "numerical" weaknesses.  It is a strength
because you can do some very cool things and retain a lot of control
over just what the system is doing with memory and in just what order --
it is left as an exercise how to (for example) allocate a single vector
from which one can pack addresses into a triangular y[l][m] complex
array indexed by l \in 0,lmax, m \in -l,l (angular momentum indices with
no wasted space and fully automatic indexing (no actual computation of
offsets except during the original allocation/packing.  Or to pack up a
triangular array of pretty much arbitrary structs.  Also, the vector
adata can be passed quite freely to e.g. ODE solvers and you can still
use array forms (no offset arithmetic) in the derivative evaluators by
passing in the address of **a as one of the ODE parameters.
a[i][j][k][l][m][n] REALLY beats the hell out of trying to do offset
arithmetic with explicit e.g imax, ioffset, jmax, joffset, kmax,
koffset...nmax, noffset (and the indices) for a six dimensional array.
Compilers usually do a[][]... offset arithmetic "fast", as well, but
have to be told to do the offset computed the hard way fast.

It is a weakness because the compiler cannot KNOW how to optimize
a[i][j] based operations in terms of e.g. loop unrolling and using
registers the way it can in fortran.  Fortran's rigid structure and
memory allocation permits compiler optimizations that are usually
somewhat stronger than those that can be used by a c compiler.

Anyway, this is how I THINK it goes.  I'm always happy to be corrected
by a real expert.

    rgb

>
> Thanks
>
> Lazar
>
> __________________________________________________
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam protection around
> http://mail.yahoo.com
>

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb@phy.duke.edu


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: gslclapack
  2006-03-31 23:19     ` gslclapack Brian Gough
@ 2006-04-03 19:47       ` Gerard Jungman
  0 siblings, 0 replies; 24+ messages in thread
From: Gerard Jungman @ 2006-04-03 19:47 UTC (permalink / raw)
  To: gsl-discuss

On Sat, 2006-04-01 at 00:19 +0100, Brian Gough wrote:
> 
> Until LAPACK has transpose arguments to make a 'LAPACK trick' possible
> it's a lost cause and is probably easier for people call LAPACK
> directly on a case-by-case basis.

I am hoping that the new work will address this problem. Check the
message I sent earlier with the link to the 2004 Dongarra et al.
proposal. I'm sure you've seen this; I think we talked about it.

We need to figure out what they are doing, and act accordingly.
They may be solving all these problems.

-- 
Gerard Jungman <jungman@lanl.gov>
Los Alamos National Laboratory

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2006-04-03 19:47 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-03-26  2:41 gslclapack James Bergstra
2006-03-28 11:48 ` gslclapack Brian Gough
2006-03-28 12:14   ` gslclapack C J Kenneth Tan -- OptimaNumerics
2006-03-28 13:43     ` gslclapack Robert G. Brown
2006-03-28 23:38       ` gslclapack C J Kenneth Tan -- OptimaNumerics
2006-03-29  1:20         ` gslclapack Robert G. Brown
2006-03-29  1:57       ` gslclapack James Bergstra
2006-03-29  0:33   ` gslclapack Gerard Jungman
2006-03-29 17:45     ` gslclapack Linas Vepstas
2006-03-29 18:04       ` gslclapack J.J.Green
2006-03-29 18:51         ` gslclapack James Bergstra
2006-03-30 16:44           ` [row/column majority] gslclapack Z F
2006-03-30 19:59             ` Robert G. Brown
2006-03-30 21:04               ` Z F
2006-04-01 21:48                 ` Robert G. Brown
2006-03-30 22:24               ` James Bergstra
2006-03-30 23:01                 ` Z F
2006-03-31  0:09                 ` Linas Vepstas
2006-03-31 17:45                   ` James Bergstra
2006-03-31 18:10                     ` Linas Vepstas
2006-03-31 23:30                     ` Brian Gough
2006-03-29 23:04       ` gslclapack Gerard Jungman
2006-03-31 23:19     ` gslclapack Brian Gough
2006-04-03 19:47       ` gslclapack Gerard Jungman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).