From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gsl-discuss-return-5110-listarch-gsl-discuss=sources.redhat.com@sourceware.org>
Received: (qmail 4311 invoked by alias); 7 Jun 2007 14:17:28 -0000
Received: (qmail 4297 invoked by uid 22791); 7 Jun 2007 14:17:26 -0000
X-Spam-Check-By: sourceware.org
Received: from mail.phy.duke.edu (HELO mail.phy.duke.edu) (152.3.182.2)
    by sourceware.org (qpsmtpd/0.31) with ESMTP; Thu, 07 Jun 2007 14:17:22 +0000
Received: from localhost (localhost.localdomain [127.0.0.1])
	by mail.phy.duke.edu (Postfix) with ESMTP id 2E95F1508AD;
	Thu,  7 Jun 2007 10:17:18 -0400 (EDT)
Received: from mail.phy.duke.edu ([127.0.0.1])
	by localhost (mail.phy.duke.edu [127.0.0.1]) (amavisd-new, port 10026)
	with LMTP id 70VVD78WMKFd; Thu,  7 Jun 2007 10:17:18 -0400 (EDT)
Received: from lilith.rgb.private.net (client212-5.dsl.intrex.net [209.42.212.5])
	by mail.phy.duke.edu (Postfix) with ESMTP id 50FC61500C0;
	Thu,  7 Jun 2007 10:17:16 -0400 (EDT)
Date: Thu, 07 Jun 2007 14:17:00 -0000
From: "Robert G. Brown" <rgb@phy.duke.edu>
To: Michael <comtech.usa@gmail.com>
cc: help-gsl@gnu.org, gsl-discuss@sources.redhat.com
Subject: Re: Will I get speedup by providing Jacobian to the GSL ODE45 solver?
In-Reply-To: <b1f16d9d0706061654o441843f1q8fe2d902b5743c72@mail.gmail.com>
Message-ID: <Pine.LNX.4.64.0706070921150.8569@lilith.rgb.private.net>
References: <b1f16d9d0706061654o441843f1q8fe2d902b5743c72@mail.gmail.com>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Mailing-List: contact gsl-discuss-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <gsl-discuss.sourceware.org>
List-Subscribe: <mailto:gsl-discuss-subscribe@sourceware.org>
List-Archive: <http://sourceware.org/ml/gsl-discuss/>
List-Post: <mailto:gsl-discuss@sourceware.org>
List-Help: <mailto:gsl-discuss-help@sourceware.org>, <http://sourceware.org/ml/#faqs>
Sender: gsl-discuss-owner@sourceware.org
X-SW-Source: 2007-q2/txt/msg00037.txt.bz2

On Wed, 6 Jun 2007, Michael wrote:

> Will I get speedup by providing Jacobian to the GSL ODE45 solver?
>
> Hi all,
>
> I am a newbie to GSL ODE solvers. I have a critical need for speed. I
> am hunting for ways to speed up. My questions are:
>
> 1. If after a comparison of GSL ODE solvers, I conclude that ODE45 is
> sufficient for my need in terms of accuracy. (And in fact I turned to
> GSL ODE solver from Matlab, and I have already done a comparison
> analysis using all the ODE solvers in Matlab. ) And I remember I don't
> provide Jacobian to ODE45 solver. And now will I get speed-up at the
> fixed accuracy level by providing the Jacobian to the GSL ODE45
> solver?
>
> 2. Is it possible that after providing the Jacobian, other solvers use
> the Jacobian more efficiently and start to outform ODE45 in terms of
> speed or accuracy?
>
> I am still experimenting these things, but I would like to hear your
> comments too...
>
> Thanks for your help!

I'll reply in just one place.  There are lots of things to answer from
your previous messages, so forgive me/remind me if I forget one.  I'll
have to make this (uncharacteristically, for me:-) brief as I have some
other work to do today, so bear with me.

ODE45 in matlab is probably rkf45 in GSL and it is an excellent general
purpose ODE solver, I agree.  But, as I already noted in a previous
message, some of the alternatives sometimes turn out to be faster and
just as accurate, so I'd still recommend that you experiment with this
if speed is THE defining goal here.  If it is, the it is worth the time
required to prototype and run the code for a day or two to tune it up,
and it is trivial to swap ODE solvers with the GSL, really.  At most you
have to write the Jacobian code.

Don't be "scared" of doing this because you don't know what a Jacobian
is (not unlikely, actually:-).  GIYF:

  http://en.wikipedia.org/wiki/Jacobian

It is just the matrix of partial derivatives that show how all of the
variables in the vector being solved depend on (in the case of the ODE
solver) each other (and in the more general case, on some underlying set
of "spatial" coordinates).  The van der pol oscillator example should
make this clear.  However, if you are solving (say) single variable ODEs
and make your vector "system" out of N independent 1-D ODES, each with
the same derivative, then the Jacobian matrix is diagonal and trivial --
basically the unit matrix.

OK, now, to speed things up.  Speed costs money, so you have to decide
how "valuable" the results are to you (what you want to spend) and then
optimize within that expenditure limit.  You can also spend "nothing"
and optimize on whatever hardware you have, but you can't do this AND
insist that you be able to solve umpty-zillion ODE instances in finite
time.  Reality is real, and you have to live within the physical
limitations of your hardware or spend more on better/more hardware.

There are three distinct dimensions you can optimize:

  * Number of CPUs.  You are fortunate in that your problem decomposes
into embarrassingly parallel subproblems trivially.  This means you can
basically do your work twice as fast by using two systems at the same
time, N times as fast by using N systems (less a hair for overhead).  So
one way to get linear speedup is to buy more systems and make a small
beowulf cluster to do your work.  If you are interested in this
approach, check out:

   http://www.phy.duke.edu/~rgb/Beowulf/beowulf_book.php

to learn about parallel scaling and getting started with beowulfery.
Also look at:

   http://www.clustermonkey.net

where there are all sorts of articles on this, including a whole series
by yours truly aimed at newbies.  Note well that this NEED NOT BE VERY
EXPENSIVE!  These days $3000 will buy you 8 64-bit opteron cores with at
least 512 MB per core in only two boxes, and on an embarrassingly
parallel problem that is a HUGE amount of compute capacity!

  * Quality of CPU, memory bus, etc.  Cheap, single core, small cache
32-bit CPU cores on a slow plodgy memory subsystem will not perform as
well as a bleeding edge 64-bit cores on a fast memory subsystem for
problems of this type.  If you're currently working on a 1.6 GHz Celeron
and move your code over to a 3 GHz 64-bit CPU, it will speed up by maybe
four right there.  If you move it to a dual core CPU that's eight.  If
you move it to a dual dual core system, that is as much as 16x speedup
right there per SINGLE system.

There are price point nonlinearities in here you should be aware of.
Your problem is almost certainly CPU bound -- unlikely to depend
strongly on anything but CPU clock and maybe memory bus clock and
architecture when you have enough cores hitting the same memory.
Bleeding edge CPUs are MUCH more expensive than ones only 10% to 20%
slower that aren't quite so advanced.  Since your problem decomposes,
you should multiply the number of CPU cores you can afford per
architecture by the system clock of those cores and optimize the
resulting number, remembering tha AMD cores are typically faster at the
same clock than Intel cores but YMMV and it's up to you to prototype
and compare, not me.

   * Writing your code to take advantage of vectorization and CPU cache,
etc.

This latter is tricky.  I know that you want me to show you an example
of a vectorized solution, but I'd have to write out a full answer and
test it (just like you will) and I don't have time to, so you'll have to
try it on your own.  I'll describe "how" to proceed, and assume/hope
that you can manage the actual coding.

First of all, it isn't completely clear that you are better off
vectorizing the independent ODEs to be solved vs calling the ODE solver
one at a time.  The trade off involved is the overhead of starting up
the ODE solver (a fixed cost per invocation per solution) vs the speedup
advantage of executing all of the core loop in cache while working on
register memory.  At some point increasing the size of the problem
forces the system out of cache and maybe even out of being relatively
local in memory (unlikely, actually, if your problem decomposes).  This
nonlinearly slows it down to a new scaling regime.  However, if you
really only evaluate the solution for four or five points per run, the
overhead of startup can be a significant fraction of the time of
solution (or not -- something to check out) and you may be better off
vectorizing a number of solutions into a single startup call.

In other words, there is LIKELY to be an optimum that looks like the
following:

   a) Put each independent ODE to be solved into a component of a vector
y[].  Make the length of y a variable so you can adjust it at run time.

   b) Write code that systematically assigns your parameter space to the
vector of initial conditions or parameters passed to the ODE solver or
in a global parameter vector -- it doesn't matter how, but it has to be
systematic and automated.

   c) Write code that evaluates the vector of deriviatives -- if all that
changes is the parameter you can reuse the actual derivative code in a
trivial loop.

   d) Write code that makes the jacobian a unit matrix -- no y[] depends
on any other y[] and \partial y_i/\partial y_i = 1.  Use the GSL example
as a template, but your code will be much simpler on the right hand
side of the assignments.

   e) Test it for (say) rkf45, why not.  Make sure that it gives you the
same answers you were getting in matlab, that the evolution routine
works flawlessly to each of the desired times, etc.

Now experiment.  Try running it varying N and plot out the average time
required to complete per solution.  Try varying ODE solvers and repeat.
Try varying architecture (if you have more than one kind of system to
use to run it on).

If god is good to you, you will rapidly determine that the code has one
of the three generic limiting forms:

   * It will run fastest with N=1, so you are best off running a loop to
run the ODE solver N times (on as many hosts/cores as you can manage to
split up the work) for 1 problem at a time.

   * It will run fastest with the largest possible N that will fit the
program comfortably in memory -- maybe N = 2^20, maybe N = 2^28 (if you
have a lot of memory -- GB+ -- to run in).

   * It will run fastest somewhere in between.

where the answers WILL vary per architecture tested and maybe per ODE
tested.

I don't even know how I would bet between these three.  I suppose I'd
"expect" an optimum somewhere in the ballpark of N = 32K -- small enough
to keep the code in L1 and data in L2, big enough to spread the startup
overhead out over a "large" number of calls, but modern memory
subsystems could easily perform well enough to make the second one true
or if you carefully write the startup code so it avoids reallocating the
ODE structs from run to run and use register variables throughout might
make the first one true.  Just carefully coding can move the optimum
around -- it is really another degree of freedom.

I personally would advise you not to go the assembler route unless it is
your last resort.  C compilers are actually pretty efficient and you'd
have to work long and hard to get a really significant speedup out of it
compared to what you could do by just careful C coding.  Indeed, I'd try
rewriting the code in C first to maybe leave out some of the GSL-added
overhead of testing for successful completions and the like -- at your
own considerable risk -- and vectorizing the core loops in that way.

But I actually wouldn't try that -- not for a parallelizable problem.
Code portability and readability and reliability matter, and using the
GSL is a good idea for all of these reasons.  Hardware is cheaper than
human time.  Buy some and throw it at the problem, accepting that you
can't squeeze blood from turnips and you can't get 2^25 solutions per
second out of hardware capable of generating only 2^20, but that if you
add 2^5 CPU cores (a mere 32 processors) you can...

HTH,

     rgb

--


-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb@phy.duke.edu