Re: gslclapack

public inbox for gsl-discuss@sourceware.org
 help / color / mirror / Atom feed

* Re: gslclapack
@ 2006-03-29 23:14 Gerard Jungman
  2006-03-30 13:21 ` gslclapack Robert G. Brown
  0 siblings, 1 reply; 16+ messages in thread
From: Gerard Jungman @ 2006-03-29 23:14 UTC (permalink / raw)
  To: gsl-discuss

One more small point (thanks to Jim Amundson for pointing this out):
ATLAS not only has blas functions, but also has a few lapack routines.
I don't know what their intention is/was in this direction, but this
may add something to the stew.

-- 
Gerard Jungman <jungman@lanl.gov>
Los Alamos National Laboratory

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: gslclapack
  2006-03-29 23:14 gslclapack Gerard Jungman
@ 2006-03-30 13:21 ` Robert G. Brown
  0 siblings, 0 replies; 16+ messages in thread
From: Robert G. Brown @ 2006-03-30 13:21 UTC (permalink / raw)
  To: Gerard Jungman; +Cc: gsl-discuss

On Wed, 29 Mar 2006, Gerard Jungman wrote:

>
> One more small point (thanks to Jim Amundson for pointing this out):
> ATLAS not only has blas functions, but also has a few lapack routines.
> I don't know what their intention is/was in this direction, but this
> may add something to the stew.

If we're spicing up the stew, this might be a good time to reconsider
the way matrices are handled in GSL in general.  Some time ago there was
a discussion of multidimensional (high rank) tensors -- 3-9 dimensions,
basically, of use in e.g. relativistic theories, lattice gauge theories,
a whole bunch of physics -- which in C are typically dynamically
allocated via looped pointers and offsets.  There is also a strong
motivation to allocate the actual physical memory as a single contiguous
vector (perhaps inside a struct) so it can be passed easily to
vector-based e.g. ode solvers.  Since such multidimensional matrices may
themselves be matrices of structs as easily as matrices of e.g. ints of
doubles, in the idealest of cases the constructors and destructors would
need to do their work with void typing and permit a cast to determine
what exactly is in the matrix and how large it is.

At that time I wrote such a matrix allocator in something like the GSL
style -- the style itself had to be slightly modified as the basic
objects in use pretty much presupposed 1-2d simple matrices.  It could
manage up to (IIRC, I haven't looked at it or used it for a couple of
years now) 9th rank tensors or thereabouts with more or less arbitrary
index ranges on each slot.  If there is any interest in this I'd be
happy to revive it and/or contribute it for consideration.

The point being that it sounds like doing lapack "right" in GSL may
involve messing some with the GSL matrix definition anyway, which in
turn may involve a "major" number bump -- routines that use matrices may
need to be rewritten to accomodate this.  If this happens, it would be a
really good time to rethink tensors altogether and implement SOME sort
of high-rank tensor in the GSL, ideally making matrices 2d object in a
smooth hierarchy.

The other thing that would need to be considered at that time, BTW, is
that whatever matrix schema is used, it needs to minimize redirection
and "object" like access in favor of efficiency.  One reason that people
still use Fortran for a lot of linear algebra-based arithmetic is that
its fairly rigorous matrix typing permits compiler-level optimizations
that object-like implementations of C and/or C++ do not.  This is the
reason that I altered the way matrices were packed in my code -- the
actual storage is always a single contiguous vector *data whose offsets
are repacked into the usual **...*matrix objects.  One lets the compiler
manage the offset arithmetic during access in matrix[i][j]...[n] form,
but one HAS the data in data[i] form under a single pointer, where a
cast permits either one to manage the right offsets.  I'm not certain
that this is the most efficient way of doing it, BTW -- it is just one
that makes it very easy to code and to be able to visualize e.g.
blocking and stride while one does so.

This is the kind of thing that it would be good for a Real Computer
Scientist (and C God) to look at and pronounce upon.  I'm at best a
possibly gifted amateur here for all of the C code I've written over the
years -- I haven't focussed on operational efficiency the way the
compiler guys have.  Maybe Greg Lindahl or somebody else from pathscale
or the gcc project could give us guidance on the most efficient ways of
implementing this, or we could even work hand in hand with the gcc folks
to get "blessed" optimization of whatever way tensor forms are
implemented in pointers so that we could eventually make the GSL as
efficient as a fortran library in this regard at least with gcc.

    rgb

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb@phy.duke.edu

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: gslclapack
  2006-03-31 23:19     ` gslclapack Brian Gough
@ 2006-04-03 19:47       ` Gerard Jungman
  0 siblings, 0 replies; 16+ messages in thread
From: Gerard Jungman @ 2006-04-03 19:47 UTC (permalink / raw)
  To: gsl-discuss

On Sat, 2006-04-01 at 00:19 +0100, Brian Gough wrote:
> 
> Until LAPACK has transpose arguments to make a 'LAPACK trick' possible
> it's a lost cause and is probably easier for people call LAPACK
> directly on a case-by-case basis.

I am hoping that the new work will address this problem. Check the
message I sent earlier with the link to the 2004 Dongarra et al.
proposal. I'm sure you've seen this; I think we talked about it.

We need to figure out what they are doing, and act accordingly.
They may be solving all these problems.

-- 
Gerard Jungman <jungman@lanl.gov>
Los Alamos National Laboratory

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: gslclapack
  2006-03-29  0:33   ` gslclapack Gerard Jungman
  2006-03-29 17:45     ` gslclapack Linas Vepstas
@ 2006-03-31 23:19     ` Brian Gough
  2006-04-03 19:47       ` gslclapack Gerard Jungman
  1 sibling, 1 reply; 16+ messages in thread
From: Brian Gough @ 2006-03-31 23:19 UTC (permalink / raw)
  To: gsl-discuss

Gerard Jungman writes:
 >   - We discussed creating a lapack interface, similar to the
 >     blas interface. The basic limitation was lack of manpower
 >     and some statements about the difficulty of dealing with
 >     workspace allocation/management, which I did not understood
 >     at the time, and still don't understand.

The basic problem with a LAPACK interface is there is no clean way to
do it.  I've tried this several times.  As we remember, with BLAS you
have the 'CBLAS trick' to get the row-major versions of Fortran BLAS
by using the BLAS transpose argument, which is conveniently provided
in all necessary functions.

The LAPACK functions typically don't have a transpose argument and are
hard-coded for column-major.  This means either you have to use
column-major in C (which is inconvenient if you have other C-style
matrices around) or you need to provide transposed matrices, either by
logically solving the transposed system or physically transposing them
before each call and then transposing them back (which is what Numpy
does).

If the matrix has M=N or N=TDA then you can do the transpose in-place,
so you don't need extra memory.  But in the general case M!=N, N!=TDA
you have to allocate a copy of the matrix.  Obviously for large
matrices this is wasteful of memory, but these are the sorts of
systems you would want to solve.  

Already you are looking at 3 possible types of wrapper function for
each LAPACK function to handle the different cases.  But it's more
complicated than that because for functions with multiple arguments
you might have different cases for each argument.  

So either there has to be some radical simplifying assumption like
"always allocate memory and transpose" or "only handle matrices with
N=TDA" or "make the user pass everything as column-major". But people
will surely complain whichever is chosen.  

Until LAPACK has transpose arguments to make a 'LAPACK trick' possible
it's a lost cause and is probably easier for people call LAPACK
directly on a case-by-case basis.

-- 
Brian Gough

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: gslclapack
  2006-03-29 17:45     ` gslclapack Linas Vepstas
  2006-03-29 18:04       ` gslclapack J.J.Green
@ 2006-03-29 23:04       ` Gerard Jungman
  1 sibling, 0 replies; 16+ messages in thread
From: Gerard Jungman @ 2006-03-29 23:04 UTC (permalink / raw)
  To: Linas Vepstas; +Cc: gsl-discuss

On Wed, 2006-03-29 at 11:44 -0600, Linas Vepstas wrote:
> I found this conversation a tad too abstract.
> 
> On Tue, Mar 28, 2006 at 05:30:07PM -0700, Gerard Jungman wrote:
> >     I am hoping that the clapack standard
> >     says something about this situation; after all, it
> >     ought to be designed to work well with cblas.
> 
> I don't udnerstand. A google of clapack reveals things like 
> http://www.netlib.org/clapack/
> 
> which seems no better than calling the fortran libs directly 
> from c code. And the problem with this is, of course, that 

Sorry. I was not precise enough. There is a thing called "clapack",
which is disgusting (I think it is basically f2c run on the lapack
source, followed by twiddling). What I mean by a "C lapack" is
whatever the new development lapack cycle is going to produce.
Currently, Dongarra et al are supposed to be working on improved
versions of lapack and sca-lapack. This new work includes
standardized wrappers for a set of languages, like C, python, f95, etc.
They propose to implement in f77, as before, which is understandable.

The (funded) proposal is available at
http://www.cs.berkeley.edu/~demmel/Sca-LAPACK-Proposal.pdf

The announcement related to this can be found by 
searching for "New Release of LAPACK and ScaLAPACK planned"
on google-groups.

They state explicitly that they want to use "better interfaces
using features of up-to-date programming langauges" to "hide
some of the complex distributed data structures and other features".
These are exactly the kind of interface improvements we want,
so it would be great if they could sort it all out and
produce something attractive.

> > If somebody dropped a working lapack interface on the table,
> > I would certainly want to put it in GSL. It's a clear win.
> 
> I presume you mean "a C wrapper that solves the strange 
> calling convention, documentiation and workspace problems". ?

Exactly, as per the above.

Linas, thanks for the chance to clarify what I meant!

-- 
Gerard Jungman <jungman@lanl.gov>
Los Alamos National Laboratory

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: gslclapack
  2006-03-29 18:04       ` gslclapack J.J.Green
@ 2006-03-29 18:51         ` James Bergstra
  0 siblings, 0 replies; 16+ messages in thread
From: James Bergstra @ 2006-03-29 18:51 UTC (permalink / raw)
  To: gsl-discuss

On Wed, Mar 29, 2006 at 07:04:39PM +0100, J.J.Green wrote:
> I think the problem with calling fortran from C is that
> it is compiler, OS and (I think) architecture dependant.
> CLAPACK solves the portability problem, but as you have
> noticed, not the arcane interface which puts most poeple
> off using it.

Exactly the motivation for a gsl wrapper (or extension)!  A
gsl module would serve both as a gentle introduction to
using lapack, as well as a hook to place normal
documentation.

It seems to me that the largest barrier to a lapack wrapper
is that gsl_matrix is an implicitly row-major structure, so
that in general, arguments must be out-of-place transposed
before a lapack call.  Although this is not significant in
terms of either O(n) time or space,  the overhead is more
noticeable in the case of large numbers of calls with small
arguments.

Anyway, it seems like the concensus is that lapack is
annoying to use but good in theory, once it's working.  If
someone (eg. me) writes an extension that makes (at least
part of lapack) less annoying to use in practice, share with
the list?

James

PS.  My extension would probably introduce a column-major
version of gsl_matrix.

-- 
james bergstra
http://www-etud.iro.umontreal.ca/~bergstrj

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: gslclapack
  2006-03-29 17:45     ` gslclapack Linas Vepstas
@ 2006-03-29 18:04       ` J.J.Green
  2006-03-29 18:51         ` gslclapack James Bergstra
  2006-03-29 23:04       ` gslclapack Gerard Jungman
  1 sibling, 1 reply; 16+ messages in thread
From: J.J.Green @ 2006-03-29 18:04 UTC (permalink / raw)
  To: Linas Vepstas; +Cc: Gerard Jungman, gsl-discuss

Hi

> which seems no better than calling the fortran libs directly
> from c code. And the problem with this is, of course, that
   [snip]
> These seem to be the major probblems, right?

I think the problem with calling fortran from C is that
it is compiler, OS and (I think) architecture dependant.
CLAPACK solves the portability problem, but as you have
noticed, not the arcane interface which puts most poeple
off using it.

Cheers

Jim
-- 
J.J. Green, Dept. Applied Mathematics, Hicks Bld.,
University of Sheffield, UK.   +44 (0114) 222 3742
http://pdfb.wiredworkplace.net/pub/jjg


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: gslclapack
  2006-03-29  0:33   ` gslclapack Gerard Jungman
@ 2006-03-29 17:45     ` Linas Vepstas
  2006-03-29 18:04       ` gslclapack J.J.Green
  2006-03-29 23:04       ` gslclapack Gerard Jungman
  2006-03-31 23:19     ` gslclapack Brian Gough
  1 sibling, 2 replies; 16+ messages in thread
From: Linas Vepstas @ 2006-03-29 17:45 UTC (permalink / raw)
  To: Gerard Jungman; +Cc: gsl-discuss


I found this conversation a tad too abstract.

On Tue, Mar 28, 2006 at 05:30:07PM -0700, Gerard Jungman wrote:
>     I am hoping that the clapack standard
>     says something about this situation; after all, it
>     ought to be designed to work well with cblas.

I don't udnerstand. A google of clapack reveals things like 
http://www.netlib.org/clapack/

which seems no better than calling the fortran libs directly 
from c code. And the problem with this is, of course, that 

-- the fortran subroutine names are a bit too short and cryptic
-- the fortran calling convention is rather bizarre for C
   (see e.g. the ick at http://www.netlib.org/clapack/clapack.h)
-- the only decent documentation for the actual subroutine 
   arguments seems to exist only in the fortran code, making 
   it hard to read. :-( 

These seem to be the major probblems, right?

>   - We discussed creating a lapack interface, similar to the
>     blas interface. The basic limitation was lack of manpower
>     and some statements about the difficulty of dealing with
>     workspace allocation/management, which I did not understood
>     at the time, and still don't understand.

Right. This is part of what makes the existing fortran interfaces
nasty to use.

> If somebody dropped a working lapack interface on the table,
> I would certainly want to put it in GSL. It's a clear win.

I presume you mean "a C wrapper that solves the strange 
calling convention, documentiation and workspace problems". ?

--linas

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: gslclapack
  2006-03-28 13:43     ` gslclapack Robert G. Brown
  2006-03-28 23:38       ` gslclapack C J Kenneth Tan -- OptimaNumerics
@ 2006-03-29  1:57       ` James Bergstra
  1 sibling, 0 replies; 16+ messages in thread
From: James Bergstra @ 2006-03-29  1:57 UTC (permalink / raw)
  To: gsl-discuss

> >>> On one hand, I'm jealous that lapack has a fancy routine, and react 
> >>"let's
> >>> implement that for gsl too" and on the other hand, I think "why is gsl
> >>> re-implementing (what I believe is) a very standard library?"
> >>
> >>LAPACK is huge & tedious to install.  GSL is meant to be easy to use -
> >>a dependency on lapack would be a pain.
> >
> >How do you find LAPACK being a pain?  The public version of LAPACK has
> >been well tested and the design has been well thought out.  It has a
> >very comprehensive test suite also.
> 
> Is this an either/or proposition?  Having its own lapack implementation
> just eliminates a dependency which I agree is desireable, and since the
> system-provided LAPACK often sucks as far as efficiency goes it also
> means that one CAN gradually start to tune up gsllapack. If the calls
> are the same, though, presumably one could link into an alternative
> library via suitable compile/link flags.  I thought this was the way
> things were being done already, actually, so one could e.g. use ATLAS
> BLAS:

My hope was to encourage interchangeable implementations.

But it seems lapack routines cannot be used as drop-in replacements to implement
linalg_xxxx ones on account of the row-major, col-major business.

Is this true?

-- 
james bergstra
http://www-etud.iro.umontreal.ca/~bergstrj

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: gslclapack
  2006-03-28 23:38       ` gslclapack C J Kenneth Tan -- OptimaNumerics
@ 2006-03-29  1:20         ` Robert G. Brown
  0 siblings, 0 replies; 16+ messages in thread
From: Robert G. Brown @ 2006-03-29  1:20 UTC (permalink / raw)
  To: C J Kenneth Tan -- OptimaNumerics
  Cc: Brian Gough, James Bergstra, gsl-discuss

On Tue, 28 Mar 2006, C J Kenneth Tan -- OptimaNumerics wrote:

> As I have mentioned above, the BLAS layer is kept the same in each of the
> benchmarks, therefore the benchmarks show the differences in LAPACK
> performance.

Well, clearly that isn't NECESSARILY true.  Especially not if the BLAS
layer is doing 80% or 90% of the actual CPU-driven work for a given
call, as it is for many calls so that one can obtain the aforementioned
factor of 2-3 speedup that many people have reported for ATLAS-tuned
BLAS based LAPACK.  If BLAS variation alone can drop timings to 30% of
their worst case values, then reducing the non-BLAS code fraction timing
to ZERO would still leave you with only 30% speedup relative to a bad
BLAS, if you understand what I'm saying.

If you disagree with this, well it only makes the point more strongly --
it means that yes, LAPACK structure and BLAS structure ARE strongly
coupled so that testing LAPACK variations "holding BLAS constant" is
pointless, if you're moving the LAPACK in question into tune with the
BLAS in question and both have plenty of room for improvement.

Not that I'm disagreeing with the idea that different LAPACKs have
different performance or that yours is probably better -- only with the
idea that you can separate out LAPACK-specific performance issues from
the compiler, BLAS layer, linker and application/benchmark program by
"just" keeping some or all of them the same while varying the LAPACK
code.  Programming is way too nonlinear and coupled for that to be
valid.

I'd also be very suspicious of 1000% effects ANYWHERE in real-world
programming (and yes, very especially in LAPACK with a common BLAS
layer:-) -- that smacks of superlinear speedup or something that is true
but useless for 90% of all programmers -- or some truly terrible code,
ALGORITHMICALLY terrible and not just terrible from the point of view of
hand tuning some loops.

Not that I don't wish it were true and possible...I just never have
found it to be so except when fixing really boneheaded code... nothing
commercial grade, nothing competent programmer grade.  Although sure,
some LAPACKs may be very poor indeed here and effectively flog the cache
with their strides, just as non-ATLAS BLAS can easily do.

>> Or if you like, a longer term question that is quite worth bringing up
>> is to what extent it is desireable to introduce architecture specific
>> tuning into actual GSL code, period.
>
> Is this considered scalable from a development perspective?

I'd guess not, mostly.  But it's really up to the developers, right?

>> On the one hand, it makes the code ugly and extremely difficult to
>> maintain and offers small marginal performance gains in most places BUT
>> linear algebra.  It tends to depend heavily on compiler as well -- if
>> you write code optimized for e.g.  presumed SSE1 or SSE2 instructions
>> your compiler has to support then, not just your CPU.  It makes the code
>> in a sense less portable.  It is "expensive" in human time and developer
>> time in the GSL is probably largely donated and a precious resource.
>> I'd rather give up 20-30% in "potential" performance and use the GSL for
>> free, personally, blessing the developers and their contribution to
>> world civilization as I do so. (Thanks, guys!:-)
>>
>> On the other hand, for some applications tuning CAN be a big, big
>> performance booster -- 2-3x -- and hence very valuable.  I'm working
>> (depending on whether and how much support I get) on a portable/open way
>> of writing at least moderately autotuning HPC code.  Something that will
>> get "most" of the benefit of a full ATLAS-like autotuning build without
>> architecture-specific instrumentation.  Don't hold your breath, but in a
>> year or three maybe this problem can be at least attacked without
>> playing the library juggling game.
>
> Is this approach scalable when it comes to highly complex scientific
> code, as opposed to the comparatively more straight forward BLAS level
> code?

You're addressing research-level computer science questions, really.
From what I've seen, the GSL is already pretty decent in its algorithm
implementations as far as efficiency is concerned.  It isn't necessarily
"optimized", but it is good, competent code that won't run badly on
pretty much anything.  The ALGORITHMS are good, and are well
implemented.

To do better, an algorithm needs to be architecture aware to an extent
that few programs are, or really can be.  It needs portable "easy
access" to certain critical performance-determining metrics at runtime
initialization, so it can set e.g. stride sizes and perhaps change
algorithms depending on the measured details of the CPU/memory/IO bus
collective -- viewed as a set of sizes, latencies, bandwidths and so on.
ATLAS does what amounts to an empirical search of a multiparameter space
at build time, but in principle that space could be pre-searched --
mapped out in general terms -- and a smaller set of predictors
determined for setting parameters "close to" their optima.  You might
not get 100% of the possible speedup, but I'm guessing that you can get
80-90%, and from a conversation I had with Jack Dongarra it sounds like
this is what at least one researcher addressing the question is finding
empirically.

Note well that ATLAS/BLAS is close to the limit of algorithmical
complexity here -- most algorithms one might wish to tune are likely to
be simpler, many of them setting a single parameter for stride or the
blocking of a vector based on e.g. a measured or automatically
determined knowledge of the L1/L2 cache sizes and speeds relative to
main memory.  An example of one way GSL might be retuned for
optimization using information of this sort is in random number
generation.

Many RNGs use a very simple algorithm to actually generate each
successive number -- some additions, multiplications, moduli, done.
Many RNG users use LOTS AND LOTS of RNGs in a computation.  However, the
RNG interface in the GSL tends to generate them one at a time with a
subroutine/library call layer in between.  It might be possible, for
example, for the library to at least have a parameter that one could set
at initialization time that tells it to create a VECTOR BUFFER of rands
of some suitable length -- say a page, or several pages -- and to go
ahead and fill it on the initial call using SSE-optimized vector code.
Then simply return each successive number on demand and increment a
static pointer in the vector until it runs out and then do it again.

This would likely never run more SLOWLY than single-number-per-call
code, per RNG returned, although it makes the timing of those calls
"odd".  For a MC person like myself, I could care less as long as that
average time per call goes down -- I use numbers like 10^18 of them in a
long running computation and it's my life that's awasting in the
meantime.

What's needed to support this sort of thing?  First, some evidence that
there is marginal benefit worth the effort in coding.  Second, you
probably wouldn't do it anywhere but the "bread and butter" RNGs -- the
ones that are the best, fastest, pass diehard, are the default or
equally good alternatives.  Why speed up any of the RNGs that are
obviously flawed and there only so people can compute and compare the
flaws?  Third some idea that the optimization you can do is PORTABLE and
will actually improve performance on e.g. SPARC, i386, X86_64, etc with
at most small parametric tunings that can be done at runtime based on
the aforementioned metrics.

That might not be so bad or impossible to maintain, and if there were
good software support for obtaining the right metrics, they might prove
useful to ordinary numerical programmers in normal code instances --
LOTS of people do loops filled with arithmetic, few people have the
tools or energy or knowledge to be able to determine how to block out
those loops to run fast.

>> In the meantime, hey, the GSL is open source, full GPL, viral.  That
>> means that as long as the LAPACK API is preserved -- or a wrapper of the
>> standard lapack calls provided -- you can ALWAYS hack and link your own
>> LAPACK in.  The effort in any event is WAY less than the effort required
>> to actually build an architecture-dependent LAPACK in the first place,
>> and you can freely market a "tuned" version of the GSL as long as you
>> provide its sources, and it seems to me that in the case of lapack this
>> is as MOST developing a wrapper.  I've done similar things already for
>> adding e.g. rng's to the GSL -- it isn't difficult.  Or am I missing
>> something?
>
> Yes, we agree on this.  This is what I think is a good option.  It is
> crucial that LAPACK API is preserved.

Interesting discussion:-)

A lot of it is similar to parallel performance discussions on the
beowulf list.  There are easy ways to parallelize some things, but if
you really want to get serious you may need to e.g. get Ian Foster's
book (or one of the other good parallel programming books, e.g. Amalfi)
and learn about parallel algorithms.  Then you find that yeah, your
"obvious" algorithm gave you terrible speedup for reasons that in
retrospect are equally obvious.

Performance tuning in general is SO nonlinear and SO tightly coupled to
architecture, compiler choice, library set.  Then there is algorithmic
robustness to consider (don't want to get fast, incorrect answers as
legend has it one can easily do with Cell processors if you aren't
careful).  Most real numerical code I personally have written has been a
compromise, with efficiency subjugated to portability, long term
maintainability, and not getting wrong answers to the extent I can
control it with numerical error to contend with.  I think that these are
sensible design criteria for the GSL as well, where of course one would
hope to use efficient algorithms whereever possible within these
constrants.

    rgb

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb@phy.duke.edu

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: gslclapack
  2006-03-28 11:48 ` gslclapack Brian Gough
  2006-03-28 12:14   ` gslclapack C J Kenneth Tan -- OptimaNumerics
@ 2006-03-29  0:33   ` Gerard Jungman
  2006-03-29 17:45     ` gslclapack Linas Vepstas
  2006-03-31 23:19     ` gslclapack Brian Gough
  1 sibling, 2 replies; 16+ messages in thread
From: Gerard Jungman @ 2006-03-29  0:33 UTC (permalink / raw)
  To: gsl-discuss

On Tue, 2006-03-28 at 12:48 +0100, Brian Gough wrote:
> James Bergstra writes:
>  > On one hand, I'm jealous that lapack has a fancy routine, and react "let's
>  > implement that for gsl too" and on the other hand, I think "why is gsl
>  > re-implementing (what I believe is) a very standard library?"
> 
> LAPACK is huge & tedious to install.  GSL is meant to be easy to use -
> a dependency on lapack would be a pain.

This is too cheap an answer. It also smells too much like the
old "XYZ package is too large and complicated, we just need
to build our own" syndrome.

And there is absolutely no reason why "the right way"
should be in conflict with "easy to use". For example,
I think the way GSL does the cblas interface, with a
link-time resolution, is very easy to use, and it
seems like a pretty good approximation to the right way.
(Well, it was my idea, so of course I think so...).


There have been several comments on the original message, so I'll
try not to labor over the same points. Here is what I think is obvious:

  - gsl_linalg is basically a "mistake"; it satisfies the immediate
    need of a certain class of users, but in so doing it hardwires 
    an unattractive an inflexible scheme in place for everyone; even
    if you don't use it, you pay for it because it costs development
    time and takes up intellectual real estate. And I wrote a lot of
    it, so I think my opinion counts for something in this regard.
    It was my mistake.

  - GSL should provide an interface to lapack; it need not supersede
    the gsl_linalg implementation, but can coexist with it. It can
    be done in such a way that users can choose an appropriate
    implementation at link time, in the same way that the blas
    interface works. The only issue here is that we need to work
    out how to satisfy its blas dependencies. As long as it
    works in terms of cblas conformant calls, the dependencies
    will be resolved correctly at link time when the desired
    cblas implementation is picked up. If it works some other
    way, like deferring to an underlying fortran lapack   
    implementation which uses fortran-style name lookup
    for blas functions, then that would makes things
    less coherent and this would need to be fixed up
    in some way. I am hoping that the clapack standard
    says something about this situation; after all, it
    ought to be designed to work well with cblas.

  - Lapack may be difficult to build. I wouldn't know, since I
    never had to build it. I just install the RPM. As someone
    pointed out, the efficiency of lapack is essentially inherited
    from the underlying blas, so this out-of-the-can solution
    should be ok for nearly everyone. Note that I do not consider
    this to be an argument against having a GSL interface, since
    such an interface should take care of the GSL house-keeping
    chores, marshalling the arguments in the correct way for the
    more low-level interface, and dealing with error handling.
    It would be a thin interface, but still necessary.

  - It is _always_ better to reuse a general and mature solution
    like lapack than to create a half-baked substitute, no matter
    how good your intentions or how limited the apparent scope
    of the need. Even if there is no "efficiency" gained in using
    a real lapack implementation, many things are gained. These
    include complete coverage of methods, a large suite of tests,
    a robustness that only comes from the real thing, because of
    the large existing user base, and the elimination of maintenance
    for a large chunk of specialty code.


Some historical data:

  - We discussed creating a lapack interface, similar to the
    blas interface. The basic limitation was lack of manpower
    and some statements about the difficulty of dealing with
    workspace allocation/management, which I did not understood
    at the time, and still don't understand.

  - There was some concern about evolving standards. These
    decisions were made for GSL around 1998, when even cblas
    was just a draft concept. It was not clear what was in
    the future for lapack interfaces, so we decided to
    sidestep the problem.

  - There was some concern about licensing. This is mentioned
    in the TODO file, but I do not know what the question is.
    I doubt that this can be a serious problem.


If somebody dropped a working lapack interface on the table,
I would certainly want to put it in GSL. It's a clear win.

The only hard part is that we do need a reference implementation
of some sort to ship with GSL, like the gslcblas component which
ships, so that "casual" users can hit the ground running. Such
an implementation can coexist with gsl_linalg, and gsl_linalg
can be eventually deprecated.

This whole gsl_linalg thing is kind of embarrassing. People
expect to see a coherent picture for linear algebra, at the
very least a complete and functional lapack interface.
Instead, the GSL situation just smells a little fishy,
and there is a strong sociological argument for
cleaning up the mess, in order not to scare away
potential serious users.

-- 
Gerard Jungman <jungman@lanl.gov>
Los Alamos National Laboratory

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: gslclapack
  2006-03-28 13:43     ` gslclapack Robert G. Brown
@ 2006-03-28 23:38       ` C J Kenneth Tan -- OptimaNumerics
  2006-03-29  1:20         ` gslclapack Robert G. Brown
  2006-03-29  1:57       ` gslclapack James Bergstra
  1 sibling, 1 reply; 16+ messages in thread
From: C J Kenneth Tan -- OptimaNumerics @ 2006-03-28 23:38 UTC (permalink / raw)
  To: Robert G. Brown; +Cc: Brian Gough, James Bergstra, gsl-discuss

Robert,

On 2006-03-28 08:44 -0500 Robert G. Brown (rgb@phy.duke.edu) wrote:

> If the calls are the same, though, presumably one could link into an
> alternative library via suitable compile/link flags.

Yes, the API need to be kept the same.  

> So I don't see why there is an issue here.  LAPACK efficiency was, I
> thought, "largely" inherited from BLAS although I'm sure there are
> additional efficiencies one can realize if one works for them.  

Depends on the definition of "largely".  In our LAPACK benchmarks, we
use the same BLAS for the benchmarks, therefore showing the
differences in the LAPACK layer.  We found the differences could be
very high (as much as thousands of percents).  

> However I don't see this is an either/or proposition, and looking at:
> 
> > http://www.OptimaNumerics.com/docs/hpc-asia05/hpc-asia05.pdf .
> 
> ... the important question isn't benchmarks against "competitor", it
> is benchmarks against the specific case of GSL+ATLAS tuned BLAS and the
> marginal advantages there, using the same compiler and support library
> otherwise.  I'm more than curious to see what the marginal advantages
> are of tuned LAPACK >>beyond<< those provided by using a good ATLAS
> tuned BLAS and corresponding LAPACK.

As I have mentioned above, the BLAS layer is kept the same in each of the
benchmarks, therefore the benchmarks show the differences in LAPACK
performance. 

> Or if you like, a longer term question that is quite worth bringing up
> is to what extent it is desireable to introduce architecture specific
> tuning into actual GSL code, period.

Is this considered scalable from a development perspective?

> On the one hand, it makes the code ugly and extremely difficult to
> maintain and offers small marginal performance gains in most places BUT
> linear algebra.  It tends to depend heavily on compiler as well -- if
> you write code optimized for e.g.  presumed SSE1 or SSE2 instructions
> your compiler has to support then, not just your CPU.  It makes the code
> in a sense less portable.  It is "expensive" in human time and developer
> time in the GSL is probably largely donated and a precious resource.
> I'd rather give up 20-30% in "potential" performance and use the GSL for
> free, personally, blessing the developers and their contribution to
> world civilization as I do so. (Thanks, guys!:-)
> 
> On the other hand, for some applications tuning CAN be a big, big
> performance booster -- 2-3x -- and hence very valuable.  I'm working
> (depending on whether and how much support I get) on a portable/open way
> of writing at least moderately autotuning HPC code.  Something that will
> get "most" of the benefit of a full ATLAS-like autotuning build without
> architecture-specific instrumentation.  Don't hold your breath, but in a
> year or three maybe this problem can be at least attacked without
> playing the library juggling game.

Is this approach scalable when it comes to highly complex scientific
code, as opposed to the comparatively more straight forward BLAS level
code? 

> In the meantime, hey, the GSL is open source, full GPL, viral.  That
> means that as long as the LAPACK API is preserved -- or a wrapper of the
> standard lapack calls provided -- you can ALWAYS hack and link your own
> LAPACK in.  The effort in any event is WAY less than the effort required
> to actually build an architecture-dependent LAPACK in the first place,
> and you can freely market a "tuned" version of the GSL as long as you
> provide its sources, and it seems to me that in the case of lapack this
> is as MOST developing a wrapper.  I've done similar things already for
> adding e.g. rng's to the GSL -- it isn't difficult.  Or am I missing
> something?

Yes, we agree on this.  This is what I think is a good option.  It is
crucial that LAPACK API is preserved.


Kenneth Tan
-----------------------------------------------------------------------
C J Kenneth Tan, PhD
OptimaNumerics Ltd                    Telephone: +44 798 941 7838
E-mail: cjtan@OptimaNumerics.com      Telephone: +44 207 099 4428
Web: http://www.OptimaNumerics.com    Facsimile: +44 207 100 4572
-----------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: gslclapack
  2006-03-28 12:14   ` gslclapack C J Kenneth Tan -- OptimaNumerics
@ 2006-03-28 13:43     ` Robert G. Brown
  2006-03-28 23:38       ` gslclapack C J Kenneth Tan -- OptimaNumerics
  2006-03-29  1:57       ` gslclapack James Bergstra
  0 siblings, 2 replies; 16+ messages in thread
From: Robert G. Brown @ 2006-03-28 13:43 UTC (permalink / raw)
  To: C J Kenneth Tan -- OptimaNumerics
  Cc: Brian Gough, James Bergstra, gsl-discuss

On Tue, 28 Mar 2006, C J Kenneth Tan -- OptimaNumerics wrote:

> Brian,
>
> On 2006-03-28 12:48 +0100 Brian Gough (bjg@network-theory.co.uk) wrote:
>
>> Date: Tue, 28 Mar 2006 12:48:30 +0100
>> From: Brian Gough <bjg@network-theory.co.uk>
>> To: James Bergstra <james.bergstra@umontreal.ca>
>> Cc: gsl-discuss <gsl-discuss@sources.redhat.com>
>> Subject: Re: gslclapack
>>
>> James Bergstra writes:
>> > On one hand, I'm jealous that lapack has a fancy routine, and react "let's
>> > implement that for gsl too" and on the other hand, I think "why is gsl
>> > re-implementing (what I believe is) a very standard library?"
>>
>> LAPACK is huge & tedious to install.  GSL is meant to be easy to use -
>> a dependency on lapack would be a pain.
>
> How do you find LAPACK being a pain?  The public version of LAPACK has
> been well tested and the design has been well thought out.  It has a
> very comprehensive test suite also.

Is this an either/or proposition?  Having its own lapack implementation
just eliminates a dependency which I agree is desireable, and since the
system-provided LAPACK often sucks as far as efficiency goes it also
means that one CAN gradually start to tune up gsllapack. If the calls
are the same, though, presumably one could link into an alternative
library via suitable compile/link flags.  I thought this was the way
things were being done already, actually, so one could e.g. use ATLAS
BLAS:

  Linking with an alternative BLAS library

  The following command line shows how you would link the same application
  with an alternative CBLAS library called `libcblas',

  $ gcc example.o -lgsl -lcblas -lm

  For the best performance an optimized platform-specific CBLAS library
  should be used for -lcblas. The library must conform to the CBLAS
  standard. The ATLAS package provides a portable high-performance BLAS
  library with a CBLAS interface. It is free software and should be
  installed for any work requiring fast vector and matrix operations. The
  following command line will link with the ATLAS library and its CBLAS
  interface,

  $ gcc example.o -lgsl -lcblas -latlas -lm

So I don't see why there is an issue here.  LAPACK efficiency was, I
thought, "largely" inherited from BLAS although I'm sure there are
additional efficiencies one can realize if one works for them.  However
as long as the API is the same, one could presumably do -lgsllapack vs
-llapack as a compile/link time choice.  This is moderately more complex
at compile time, but it is easy to document and methodology that any
good programmer should be familiar with anyway.

> What's the price of performance?  This question can be phrased as
> what's the price of electricity and what's the price of server room
> space.

This is dead right, of course.  NOT having the ability to use optimized
libraries can cost you as much as a factor of 2-3 in the case of ATLAS
tuned BLAS, which for a linear algebra application can double the
productivity/dollar of hundreds of thousands of dollars of cluster
hardware (as Ken's sales talk so nicely shows:-).  One doesn't want
there to be a disincentive to the use of GSL in HPC so it can continue
to take over the world:-)

However I don't see this is an either/or proposition, and looking at:

> http://www.OptimaNumerics.com/docs/hpc-asia05/hpc-asia05.pdf .

... the important question isn't benchmarks against "competitor", it
is benchmarks against the specific case of GSL+ATLAS tuned BLAS and the
marginal advantages there, using the same compiler and support library
otherwise.  I'm more than curious to see what the marginal advantages
are of tuned LAPACK >>beyond<< those provided by using a good ATLAS
tuned BLAS and corresponding LAPACK.

Or if you like, a longer term question that is quite worth bringing up
is to what extent it is desireable to introduce architecture specific
tuning into actual GSL code, period.

On the one hand, it makes the code ugly and extremely difficult to
maintain and offers small marginal performance gains in most places BUT
linear algebra.  It tends to depend heavily on compiler as well -- if
you write code optimized for e.g.  presumed SSE1 or SSE2 instructions
your compiler has to support then, not just your CPU.  It makes the code
in a sense less portable.  It is "expensive" in human time and developer
time in the GSL is probably largely donated and a precious resource.
I'd rather give up 20-30% in "potential" performance and use the GSL for
free, personally, blessing the developers and their contribution to
world civilization as I do so. (Thanks, guys!:-)

On the other hand, for some applications tuning CAN be a big, big
performance booster -- 2-3x -- and hence very valuable.  I'm working
(depending on whether and how much support I get) on a portable/open way
of writing at least moderately autotuning HPC code.  Something that will
get "most" of the benefit of a full ATLAS-like autotuning build without
architecture-specific instrumentation.  Don't hold your breath, but in a
year or three maybe this problem can be at least attacked without
playing the library juggling game.

In the meantime, hey, the GSL is open source, full GPL, viral.  That
means that as long as the LAPACK API is preserved -- or a wrapper of the
standard lapack calls provided -- you can ALWAYS hack and link your own
LAPACK in.  The effort in any event is WAY less than the effort required
to actually build an architecture-dependent LAPACK in the first place,
and you can freely market a "tuned" version of the GSL as long as you
provide its sources, and it seems to me that in the case of lapack this
is as MOST developing a wrapper.  I've done similar things already for
adding e.g. rng's to the GSL -- it isn't difficult.  Or am I missing
something?

    rgb

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb@phy.duke.edu

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: gslclapack
  2006-03-28 11:48 ` gslclapack Brian Gough
@ 2006-03-28 12:14   ` C J Kenneth Tan -- OptimaNumerics
  2006-03-28 13:43     ` gslclapack Robert G. Brown
  2006-03-29  0:33   ` gslclapack Gerard Jungman
  1 sibling, 1 reply; 16+ messages in thread
From: C J Kenneth Tan -- OptimaNumerics @ 2006-03-28 12:14 UTC (permalink / raw)
  To: Brian Gough; +Cc: James Bergstra, gsl-discuss

Brian,

On 2006-03-28 12:48 +0100 Brian Gough (bjg@network-theory.co.uk) wrote:

> Date: Tue, 28 Mar 2006 12:48:30 +0100
> From: Brian Gough <bjg@network-theory.co.uk>
> To: James Bergstra <james.bergstra@umontreal.ca>
> Cc: gsl-discuss <gsl-discuss@sources.redhat.com>
> Subject: Re: gslclapack
> 
> James Bergstra writes:
>  > On one hand, I'm jealous that lapack has a fancy routine, and react "let's
>  > implement that for gsl too" and on the other hand, I think "why is gsl
>  > re-implementing (what I believe is) a very standard library?"
> 
> LAPACK is huge & tedious to install.  GSL is meant to be easy to use -
> a dependency on lapack would be a pain.

How do you find LAPACK being a pain?  The public version of LAPACK has
been well tested and the design has been well thought out.  It has a
very comprehensive test suite also. 

Having LAPACK in the backend would also enable GSL users to take
advantage of high performance implementations of LAPACK.  We are able
to deliver substantially better performance LAPACK in OptimaNumerics
Libraries (even when compared to commercial products).  Some benchmark
results are included here:
http://www.OptimaNumerics.com/docs/hpc-asia05/hpc-asia05.pdf .

What's the price of performance?  This question can be phrased as
what's the price of electricity and what's the price of server room
space.

Kenneth Tan
-----------------------------------------------------------------------
C J Kenneth Tan, PhD
OptimaNumerics Ltd                    Telephone: +44 798 941 7838
E-mail: cjtan@OptimaNumerics.com      Telephone: +44 207 099 4428
Web: http://www.OptimaNumerics.com    Facsimile: +44 207 100 4572
-----------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: gslclapack
  2006-03-26  2:41 gslclapack James Bergstra
@ 2006-03-28 11:48 ` Brian Gough
  2006-03-28 12:14   ` gslclapack C J Kenneth Tan -- OptimaNumerics
  2006-03-29  0:33   ` gslclapack Gerard Jungman
  0 siblings, 2 replies; 16+ messages in thread
From: Brian Gough @ 2006-03-28 11:48 UTC (permalink / raw)
  To: James Bergstra; +Cc: gsl-discuss

James Bergstra writes:
 > On one hand, I'm jealous that lapack has a fancy routine, and react "let's
 > implement that for gsl too" and on the other hand, I think "why is gsl
 > re-implementing (what I believe is) a very standard library?"

LAPACK is huge & tedious to install.  GSL is meant to be easy to use -
a dependency on lapack would be a pain.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* gslclapack
@ 2006-03-26  2:41 James Bergstra
  2006-03-28 11:48 ` gslclapack Brian Gough
  0 siblings, 1 reply; 16+ messages in thread
From: James Bergstra @ 2006-03-26  2:41 UTC (permalink / raw)
  To: gsl-discuss

Please excuse a sort of continued email (further to my last comment on cholesky
decomposition).  I notice that lapack implements cholesky decomposition of real
symmetric matrices in a file called dpotrf.f.  In that file, there is apparently
a 'blocked' version of the decomposition algorithm that uses level3 blas
routines instead of level2.

On one hand, I'm jealous that lapack has a fancy routine, and react "let's
implement that for gsl too" and on the other hand, I think "why is gsl
re-implementing (what I believe is) a very standard library?"

What would be the pros and cons of slowly migrating the number-crunching from
gsl_linalg routines to gsl_clapack ones, while leaving argument-checking to the
gsl_linalg ones.  I think the division works well in the case of gsl_blas vs.
gsl_cblas.

My assumption through all this, is that clapack offers more routines, and faster
implementations than those of the GSL.  The main advantage I hope for is that
the lapack implementation is seperate from the gsl, and can be chosen at
application link-time.

-- 
james bergstra
http://www-etud.iro.umontreal.ca/~bergstrj

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2006-04-03 19:47 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-03-29 23:14 gslclapack Gerard Jungman
2006-03-30 13:21 ` gslclapack Robert G. Brown
  -- strict thread matches above, loose matches on Subject: below --
2006-03-26  2:41 gslclapack James Bergstra
2006-03-28 11:48 ` gslclapack Brian Gough
2006-03-28 12:14   ` gslclapack C J Kenneth Tan -- OptimaNumerics
2006-03-28 13:43     ` gslclapack Robert G. Brown
2006-03-28 23:38       ` gslclapack C J Kenneth Tan -- OptimaNumerics
2006-03-29  1:20         ` gslclapack Robert G. Brown
2006-03-29  1:57       ` gslclapack James Bergstra
2006-03-29  0:33   ` gslclapack Gerard Jungman
2006-03-29 17:45     ` gslclapack Linas Vepstas
2006-03-29 18:04       ` gslclapack J.J.Green
2006-03-29 18:51         ` gslclapack James Bergstra
2006-03-29 23:04       ` gslclapack Gerard Jungman
2006-03-31 23:19     ` gslclapack Brian Gough
2006-04-03 19:47       ` gslclapack Gerard Jungman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).